The regular expression dialect used in Perl, Python, and many other languages, are an extension of basic regular expressions. Some of the major differences include:
Bare parentheses for capture—instead of using quoted parentheses, Perl uses parentheses by themselves. Quoted parentheses are treated as literals.
Addition of shortcuts for character classes. See “Character Class Shortcuts.”
Addition of quotation operators. In a regular expression, anything appearing between \Q and \E will be treated as literal text even if it contains characters that would ordinarily have special meaning in a regular expression. This is useful when user input, stored in a Perl variable, is used as part of a regular expression.
Addition of "one-or-more" operator, represented by the plus (+) character.
Support for retrieving captured values outside the scope of the extension through the variables $1, $2, and so on. (See “Capture Operators and Variables” for information about capturing parts of a regular expression.)
Addition of non-greedy matching. See “Non-Greedy Wildcard Matching” for more information.
Non-capturing parentheses. See “Non-Capturing Parentheses” for more information.
Perl regular expressions add a number of additional character class shortcuts. Some of these are listed below:
\b—word boundary (see note).
\B—non-word boundary (see note).
\d—equivalent to [:digit:].
\D—equivalent to [^:digit:].
\f—form feed.
\n—newline.
\p—character matching a Unicode character property that follows. For example, \p{L} matches a Unicode letter.
\P—character not matching a Unicode property that follows. For example, \P{L} matches any Unicode character that is not a letter.
\r—carriage return.
\s—equivalent to [:space:].
\S—equivalent to [^:space:].
\t—tab.
\v—vertical tab.
\w—equivalent to [:word:].
\W—equivalent to [^:word:].
\x—start of an ASCII character code (in hex). For example, \x20 would be a space.
\X—a single Unicode character (not supported universally).
These can be used anywhere on the left side of a regular expression, including within character classes.
Note: Word boundaries do not exist in basic regular expressions. These actually match the position between two characters rather than an actual character.
A word boundary occurs before the first character of a line (if it is a word character), at the end of the line (if it ends in a word character), and between any word character and non-word character that occur consecutively.
By default, repeat operators are greedy, matching as many times as possible before attempting to match the next part of the string. This will generally result in the longest possible string that matches the expression as a whole. In some cases, you may want the matching to stop at the shortest possible string that matches the entire expression.
To support this, Perl regular expressions (along with many other dialects) supports non-greedy wildcard matching. To convert a greedy wildcard to a non-greedy wildcard, you just add a question mark after it.
For example, consider the nursery rhyme "Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go." Assume that you apply the following expression:
/Mary.*lamb/ |
That expression would match "Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb".
Suppose that instead, you want to find the shortest possible string beginning with Mary and ending with lamb. You might instead use the following expression:
/Mary.*?lamb/ |
That expression would match only the words "Mary had a little lamb".
You may notice that the syntax for capture is identical to the syntax for grouping described in “Wildcards and Repetition Operators.” In most cases, this is not a problem. However, in some cases, you may wish to avoid capturing content if you are using parentheses merely as a grouping tool.
To turn off capturing for a given set of parentheses (or quoted parentheses), you should add a question mark followed by a colon after the open parenthesis.
Consider the following example:
# Expression (Perl and Similar ONLY): /Mary (?:had)* a little lamb\./ |
perl -e "while (\$line = <STDIN>) { |
\$line =~ s/Mary (?:had )*a little lamb\./Lovely day, isn't it?/; |
print \$line; |
}" < poem.txt |
This expression will match "Mary", followed by zero (0) or more instances of "had" followed by "a little lamb", followed by a literal period, and will replace the offending line with "Lovely day, isn't it?".
Note: Non-capturing parentheses are a Perl extension to regular expressions, and are not supported by most command-line tools.
Last updated: 2008-04-08