Snag Mail Regular Expressions

A snag mail grabs a portion of a web page and turns it into a single story mailing in Informz. One feature of snag mail is the ability to use regular expressions to further query and limit what's "snagged" from the web page.

Regular expressions are almost another language by itself, but users familiar with Perl will feel right at home. Here are some of the pattern sets used to define regular expressions. These sets can be broken down into several categories and areas:

Position Matching

Position matching involves the use of the ^ and $ to search for beginning or ending of strings. Setting the pattern property to "^Informz" will only successfully match "Informz is cool." However, it will fail to match "I like Informz."

Symbol

Function

^

Only match the beginning of a string.
"^A" matches first "A" in "An A+ for Anita."

$

Only match the ending of a string.
"t$" matches the last "t" in "A cat in the hat"

\b

Matches any word boundary
"ly\b" matches "ly" in "possibly tomorrow."

\B

Matches any non-word boundary

Literals

Literals represent alphanumeric characters. Since some characters have special meanings, they must be “escaped.” To match these special characters, precede them with a "\" in a regular expression.

Symbol

Function

Alphanumeric

Matches alphabetical and numerical characters literally.

\n

Matches a new line

\f

Matches a form feed

\r

Matches carriage return

\t

Matches horizontal tab

\v

Matches vertical tab

\?

Matches ?

\*

Matches *

\+

Matches +

\.

Matches .

\|

Matches |

\{

Matches {

\}

Matches }

\\

Matches \

\[

Matches [

\]

Matches ]

\(

Matches (

\)

Matches )

\xxx

Matches the ASCII character expressed by the octal number xxx.
"\50" matches "(" or chr (40).

\xdd

Matches the ASCII character expressed by the hex number dd.
"\x28" matches "(" or chr (40).

\uxxxx

Matches the ASCII character expressed by the UNICODE xxxx.
"\u00A3" matches "£".

Character Classes

Character classes enable customized grouping by putting expressions within [] braces. A negated character class may be created by placing ^ as the first character inside the []. In addition, a dash can be used to relate a scope of characters. For example, the regular expression "[^a-zA-Z0-9]" matches everything except alphanumeric characters. In addition, some common character sets are bundled as an escape plus a letter.

Symbol

Function

[xyz]

Match any one character enclosed in the character set.
"[a-e]" matches "b" in "basketball".

[^xyz]

Match any one character not enclosed in the character set.
"[^a-e]" matches "s" in "basketball".

.

Match any character except \n.

\w

Match any word character. Equivalent to [a-zA-Z_0-9].

\W

Match any non-word character. Equivalent to [^a-zA-Z_0-9].

\d

Match any digit. Equivalent to [0-9].

\D

Match any non-digit. Equivalent to [^0-9].

\s

Match any space character. Equivalent to [ \t\r\n\v\f].

\S

Match any non-space character. Equivalent to [^ \t\r\n\v\f].

Repetition

Repetition allows multiple searches on the clause within the regular expression. By using repetition matching, the number of times an element may be repeated in a regular expression, can be specified.

Symbol

Function

{x}

Match exactly x occurrences of a regular expression.
"\d{5}" matches 5 digits.

(x,}

Match x or more occurrences of a regular expression.
"\s{2,}" matches at least 2 space characters.

{x,y}

Matches x to y number of occurrences of a regular expression.
"\d{2,3}" matches at least 2 but no more than 3 digits.

?

Match zero or one occurrences. Equivalent to {0,1}.
"a\s?b" matches "ab" or "a b".

*

Match zero or more occurrences. Equivalent to {0,}.

+

Match one or more occurrences. Equivalent to {1,}.

Alternation and Grouping

Alternation and grouping are used to develop more complex regular expressions. Using alternation and grouping techniques can create intricate clauses within a regular expression, and offer more flexibility and control.

Symbol

Function

()

Grouping a clause to create a clause. May be nested. "(ab)?(c)" matches "abc" or "c".

|

Alternation combines clauses into one regular expression and then matches any of the individual clauses.

"(ab)|(cd)|(ef)" matches "ab" or "cd" or "ef".

Backreferences

Backreferences enable the programmer to refer back to a portion of the regular expression. This is done by use of parenthesis and the backslash (\) character followed by a single digit. The first parenthesis clause is referred by \1, the second by \2, etc.

Symbol

Function

()\n

Matches a clause as numbered by the left parenthesis
"(\w+)\s+\1" matches any word that occurs twice in a row, such as "hubba hubba."

Examples

  • "^\s*((\$\s?)|(£\s?))?((\d+(\.(\d\d)?)?)|(\.\d\d))\s*(UK|GBP|GB|USA|US|USD)?)\s*$"
  • "^\s*…" and "…\s*$" - means that there can be any number of leading and end space characters, and the input must be on a line by itself
  • "((\$\s?)|(£\s?))?" - means an optional $ or £ sign followed by an optional space
  • "((\d+(\.(\d\d)?)?)|(\.\d\d))" - searches for at least one digit, followed by an optional decimal and two digits (which are optional) or a decimal and two digits. This means that input such as 6., 23.33, .88 are all allowed, but 5.5 is not.
  • "\s*(UK|GBP|GB|USA|US|USD)?" - means that any number of space characters are valid followed by optional and acceptable arguments to the string.

In this example, regular expressions are used to determine if the user entered US dollars or British pounds. Search for the strings £, UK, GBP, or GB. If the regular expression is true, then the user has entered an amount in British pounds. Otherwise, assume USD currency.