Pattern matching

part of Tcl for Web Nerds by Hal Abelson, Philip Greenspun, and Lydia Sandon; updated July 2011
Pattern matching is important across a wide variety of Web programming tasks but most notably when looking for exceptions in user-entered data and when trying to parse information out of non-cooperating Web sites.

Tcl's pattern matching facilities test whether a given string matches a specified pattern. Patterns are described using a syntax known as regular expressions. For example, the pattern expression consisting of a single period matches any character. The pattern a..a matches any four-character string whose first and last characters are both a.

The regexp command takes a pattern, a string, and an optional match variable. It tests whether the string matches the pattern, returns 1 if there is a match and zero otherwise, and sets the match variable to the part of the string that matched the pattern:

% set something candelabra
candelabra
% regexp a..a $something match
1
% set match
abra
Patterns can also contain subpatterns (delimited by parentheses) and denote repetition. A star denotes zero or more occurrences of a pattern, so a(.*)a matches any string of at least two characters that begins and ends with the character a. Whatever has matched the subpattern between the a's will get put into the first subvariable:
% set something candelabra
candelabra
% regexp a(.*)a $something match
1
% set match
andelabra
Note that Tcl regexp by default behaves in a greedy fashion. There are three alternative substrings of "candelabra" that match the regexp a(.*)a: "andelabra", "andela", and "abra". Tcl chose the longest substring. This is very painful when trying to pull HTML pages apart:
% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1
% set italicized_phrase
et cetera
% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1
% set italicized_phrase
sui generis</i> and <i>ipso facto
What you want is a non-greedy regexp, the standard feature of Perl and an option in Tcl 8.1 and later versions.

Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.

Matching Cookies From the Browser

A common problem in Web development is pulling information out of cookies that come from the client. The cookie spec in RFC 6265 mandates that multiple cookies be separated by semicolons. So you look for "the cookie name that you've been using" followed by an equals sign and them slurp up anything that follows that isn't a semicolon. Here is how to look for the value of the last_visit cookie:
regexp {last_visit=([^;]+)} $cookie match last_visit
Note the square brackets inside the regexp. The Tcl interpreter isn't trying to call a procedure because the entire regexp has been grouped with braces rather than double quotes. Square brackets denote a range of acceptable characters: The plus sign after the [^;] says "one or more characters that meets the preceding spec", i.e., "one or more characters that isn't a semicolon". It is distinguished from * in that there must be at least one character for a match.

If successful, the regexp command above will set the match variable with the complete matching string, starting from "last_visit=". Our code doesn't make any use of this variable but only looks at the subvar last_visit that would also have been set.

Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:

regexp {last_visit=([0-9]+)} $cookie match last_visit

Matching Into Multiple Variables

More generally regexp allows multiple pattern variables. The pattern variables after the first are set to the substrings that matched the subpatterns. Here is an example of matching a credit card expiration date entered by a user:
% set date_typed_by_user "06/02"
06/02
% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1
% set month
06
% set year
02
% 
Each pair of parentheses corresponds to a subpattern variable.

Full Syntax

The most general form of regexp includes optional flags as well as multiple match variables:
regexp [flags] pattern data matched_result var1 var2 ...
The various flags are Regular expression syntax is: Also see http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm, http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm, and Tcl Regular Expression Examples.

Matching with substitution

It's common in Web programming to create strings by substitution. Tcl's regsub command performs substitution based on a pattern:
regsub [flags] pattern data replacements var
matches the pattern against the data. If the match succeeds, the variable named var is set to data, with various parts modified, as specified by replacements. If the match fails, var is simply set to data. The value returned by regsub is the number of replacements performed.

The flag -all specifies that every occurrence of the pattern should be replaced. Otherwise only the first occurrence is replaced. Other flags include -nocase and -- as with regexp

Here's an example from the banner ideas module of the ArsDigita Community System (see /doc/bannerideas.html). The goal is that each banner idea contain a linked thumbnail image. To facilitate cutting and pasting of the image html, we don't require that the publisher include uniform subtags within the IMG. However, we use regexp to clean up:

# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html

In the example above, <replacements> specified the literal characters ''. Other replacement directives include:

Here's another web example, which parses HTML, and replaces the comments (delineated in HTML by <!-- and -->) by the comment text, enclosed in parentheses.
% proc extract_comment_text {html} {
    regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
    return $with_exposed_comments
}

% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)
Also see http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm

String match

Tcl provides an alternative matching mechanism that is simpler for users to understand than regular expressions. The Tcl command string match uses "GLOB-style" matching. Here is the syntax:
string match pattern data
It returns 1 if there is a match and 0 otherwise. The only pattern elements permitted here are ?, which matches any single character; *, which matches any sequence; and [], which delimits a set of characters or a range. This differs from regexp in that the pattern must match the entire string supplied:
% regexp "foo" "foobar"
1
% string match "foo" "foobar"
0
% # here's what we need to do to make the string match 
% # work like the regexp
% string match "*foo*" foobar
1
Here's an example of the character range system in use:
string match {*[0-9]*} $text
returns 1 if text contains at least one digit and 0 otherwise.

Also see http://www.tcl.tk/man/tcl8.4/TclCmd/string.htm

Continue on to Array Operations.
Return to Table of Contents

lsandon@alum.mit.edu

Reader's Comments

[A-z] includes characters other than A-Z and a-z (see the ASCII man page). I ran a test using tclsh 8.3:

% set x ^
^
% regsub {[A-z]} $x {OOPS} x2
1
% puts $x2
OOPS


-- Paul Takemura, January 29, 2005
Add a comment

Related Links

Add a link