Pattern matching

part of Tcl for Web Nerds by Hal Abelson, Philip Greenspun, and Lydia Sandon; updated July 2011

Pattern matching is important across a wide variety of Web programming tasks but most notably when looking for exceptions in user-entered data and when trying to parse information out of non-cooperating Web sites.

Tcl's pattern matching facilities test whether a given string matches a specified pattern. Patterns are described using a syntax known as regular expressions. For example, the pattern expression consisting of a single period matches any character. The pattern a..a matches any four-character string whose first and last characters are both a.

The regexp command takes a pattern, a string, and an optional match variable. It tests whether the string matches the pattern, returns 1 if there is a match and zero otherwise, and sets the match variable to the part of the string that matched the pattern:

% set something candelabra
candelabra
% regexp a..a $something match
1
% set match
abra

Patterns can also contain subpatterns (delimited by parentheses) and denote repetition. A star denotes zero or more occurrences of a pattern, so a(.*)a matches any string of at least two characters that begins and ends with the character a. Whatever has matched the subpattern between the a's will get put into the first subvariable:

% set something candelabra
candelabra
% regexp a(.*)a $something match
1
% set match
andelabra

Note that Tcl regexp by default behaves in a greedy fashion. There are three alternative substrings of "candelabra" that match the regexp a(.*)a: "andelabra", "andela", and "abra". Tcl chose the longest substring. This is very painful when trying to pull HTML pages apart:

% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1
% set italicized_phrase
et cetera
% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1
% set italicized_phrase
sui generis</i> and <i>ipso facto

What you want is a non-greedy regexp, the standard feature of Perl and an option in Tcl 8.1 and later versions.

Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.

Matching Cookies From the Browser

A common problem in Web development is pulling information out of cookies that come from the client. The cookie spec in RFC 6265 mandates that multiple cookies be separated by semicolons. So you look for "the cookie name that you've been using" followed by an equals sign and them slurp up anything that follows that isn't a semicolon. Here is how to look for the value of the last_visit cookie:

regexp {last_visit=([^;]+)} $cookie match last_visit

Note the square brackets inside the regexp. The Tcl interpreter isn't trying to call a procedure because the entire regexp has been grouped with braces rather than double quotes. Square brackets denote a range of acceptable characters:

[A-Z] would match any uppercase character
[ABC] would match any of first three characters in the alphabet (uppercase only)
[^ABC] would match any character other than the first three uppercase characters in the alphabet, i.e., the ^ reverses the sense of the brackets

The plus sign after the [^;] says "one or more characters that meets the preceding spec", i.e., "one or more characters that isn't a semicolon". It is distinguished from * in that there must be at least one character for a match.

If successful, the regexp command above will set the match variable with the complete matching string, starting from "last_visit=". Our code doesn't make any use of this variable but only looks at the subvar last_visit that would also have been set.

Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:

regexp {last_visit=([0-9]+)} $cookie match last_visit

Matching Into Multiple Variables

More generally regexp allows multiple pattern variables. The pattern variables after the first are set to the substrings that matched the subpatterns. Here is an example of matching a credit card expiration date entered by a user:

% set date_typed_by_user "06/02"
06/02
% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1
% set month
06
% set year
02
%

Each pair of parentheses corresponds to a subpattern variable.

Full Syntax

The most general form of regexp includes optional flags as well as multiple match variables:

regexp [flags] pattern data matched_result var1 var2 ...

The various flags are

-nocase
uppercase characters in the data are bashed down to lower for case-insensitive matching (make sure that your pattern is all lowercase!)
-indices
the returned values of the regexp contain the indices delimiting the matched substring, rather than the strings themselves.
If your pattern begins with a -, put a -- flag at the end of your flags

Regular expression syntax is:

.
matches any character.
*
matches zero or more instances of the previous pattern item.
+
matches one or more instances of the previous pattern item.
?
matches zero or one instances of the previous pattern item.
|
disjunction, e.g., (a|b) matches an a or a b
( )
groups a sub-pattern.
[ ]
delimits a set of characters. ASCII Ranges are specified using hyphens, e.g., [A-z] matches any character from uppercase A through lowercase z (i.e., any alphabetic character). If the first character in the set is ^, this complements the set, e.g., [^A-z] matches any non-alphabetic character.
^
Matches only when the pattern appears at the beginning of the string. The ^ must appear at the beginning of the pattern expression.
$
Matches only when the pattern appears at the end of the string. The $ must appear last in the pattern expression.

Also see http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm, http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm, and Tcl Regular Expression Examples.

Matching with substitution

It's common in Web programming to create strings by substitution. Tcl's regsub command performs substitution based on a pattern:

regsub [flags] pattern data replacements var

matches the pattern against the data. If the match succeeds, the variable named var is set to data, with various parts modified, as specified by replacements. If the match fails, var is simply set to data. The value returned by regsub is the number of replacements performed.

The flag -all specifies that every occurrence of the pattern should be replaced. Otherwise only the first occurrence is replaced. Other flags include -nocase and -- as with regexp

Here's an example from the banner ideas module of the ArsDigita Community System (see /doc/bannerideas.html). The goal is that each banner idea contain a linked thumbnail image. To facilitate cutting and pasting of the image html, we don't require that the publisher include uniform subtags within the IMG. However, we use regexp to clean up:

# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html

In the example above, <replacements> specified the literal characters ''. Other replacement directives include:

& inserts the string that matched the pattern
The backslashed numbers \1 through \9 inserts the strings that matched the corresponding sub-patterns in the pattern.

Here's another web example, which parses HTML, and replaces the comments (delineated in HTML by ) by the comment text, enclosed in parentheses.

% proc extract_comment_text {html} {
    regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
    return $with_exposed_comments
}

% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)

Also see http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm

String match

Tcl provides an alternative matching mechanism that is simpler for users to understand than regular expressions. The Tcl command string match uses "GLOB-style" matching. Here is the syntax:

string match pattern data

It returns 1 if there is a match and 0 otherwise. The only pattern elements permitted here are ?, which matches any single character; *, which matches any sequence; and [], which delimits a set of characters or a range. This differs from regexp in that the pattern must match the entire string supplied:

% regexp "foo" "foobar"
1
% string match "foo" "foobar"
0
% # here's what we need to do to make the string match 
% # work like the regexp
% string match "*foo*" foobar
1

Here's an example of the character range system in use:

string match {*[0-9]*} $text

returns 1 if text contains at least one digit and 0 otherwise.

Also see http://www.tcl.tk/man/tcl8.4/TclCmd/string.htm

Continue on to Array Operations.

Return to Table of Contents

lsandon@alum.mit.edu

Reader's Comments

[A-z] includes characters other than A-Z and a-z (see the ASCII man page). I ran a test using tclsh 8.3:
% set x ^
^
% regsub {[A-z]} $x {OOPS} x2
1
% puts $x2
OOPS
-- Paul Takemura, January 29, 2005

Add a comment