Affordable Web Hosting with Excellent Customer Support internet connection free month hosting

Regular Expression Pattern Matching

A regular expression is a pattern that describes text, or a string of text. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.

There are several different versions of regular expression syntax:

  • Basic regular expressions are commonly used by the UNIX grep and sed tools.
  • Extended regular expressions add support for expression grouping using parentheses
  • Perl has it's own varient on regular expressions, sometimes simply referred to as "perl regex".
  • Shell expressions are not really regular expressions, but may often look similar.

Regular Expression Fundamentals

The most basic building blocks of regular expressions are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Metacharacters are used to supply additional possibilities, and are most often used for searching.

A list of characters enclosed by [ and ] characters match any single character in that list; if the first character in the list is the caret ^ then the regular expression matches any character NOT in the list.

Example: the regular expression [0123456789] matches any single digit, while the regular expression [^0123456789] matches any character BUT a digit.

You can also specify a lexicographic range of characters as found in the current character set (usually UTF-7 or ASCII) by seperating the first and last characters in the range with a hyphen.

To include a literal ] place it first in the list. To include a literal ^ place it anywhere but first in the list. Finally, to include a literal -, simply place it last.

The period . will match any single character. Most regular expression libraries provide shortcuts to specify certain kinds of characters:

  • \w OR [[:alnum:]]
    PERL uses \w, and GNU regex supports both.
    They both search for any letter or digit: [A-Za-z0-9]
  • \d OR [[:digit:]]
    PERL uses \d, and GNU regex only supports [[:digit:]].
    This shortcut is used to specify ANY digit.
  • [[:alpha:]]
    GNU regex extension that searches for any letter.
  • \s
    PERL regex that searches for a whitespace
  • \W
    PERL and GNU regex that searches for anything but a "word character": [^A-Za-z0-9].
  • \D
    PERL regex that searches for anything not a digit.
  • \S
    PERL regex that searches for anything not a whitespace

Additional shortcuts are available, but these are the most commonly used.

Regular Expression Anchors

Normally, regular expressions scan the input for the pattern. However, it commonly occurs that you know that the pattern will appear at the beginning or end of the text.

This is called anchoring, and you can anchor the pattern using ^ to match the beginning of text or $ to match the end of text.

It is important to note that the ^ and $ regular expressions only work at the beginning and end of the pattern, respectively.

The anchor \< matches the empty string at the beginning of a word; the anchor \> matches the empty string at the end of a word. PERL also supports \b and \B to match and not-match, respectively, the edge at the beginning of a word. Likewise, \e and \E are supported to allow matching and not-matching the edge at the end of a word

Repetition Operators

It can be tedious to repeat a complicated regular expression. Sometimes the string being searched for has an unknown length. In such cases the repetition operators are used.

The repetition operators appear AFTER a regular expression, for example \w+ matches a word of any length.

  • ? will be matched zero or one time
  • * will be matched zero or more times
  • + will be matched one or more times
  • {num} will be matched exactly num times
  • {num,} will be matched num or more times
  • {,num} will be matched no more then num times
  • {num1,num2} will be matched at least num1 times, but no more than num2 times.

Regular expressions are concatenated together, matching any string formed by the concatenating two substrings that respectivly match the concatenated subexpressions.

Regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression

Repetition takes precedence over concatenation, which means that regular expressions ALWAYS match the largest possible string.

Differences Between Regular Expression Libraries

In Basic regular expressions, the metacharacters ?, ., *, +, (, {, |, }, and ) lose their special meaning; they must first be prefixed with a backslash to work as described in this document.

In Extended regular expressions, and PERL regular expressions, the metacharacters above will lose their special meaning when prefixed with a backslash. This is the opposite of Basic regular expressions.

In Shell expressions, only * and ? are recognized, and they have different meanings. The ? matches any single character (like regex .), while the * matches every character up to the character that follows the asterisk.

Related Items