Possessive Regular Expression Matching - Perl, Objective C and some other languages

Archive - Originally posted on "The Horse's Mouth" - 2012-03-12 15:45:23 - Graham Ellis

"I'm looking to spend between £200,000 and £225,000 on a new home" you say to the salesman and - guess what - you're offered something much nearer £225,000 that £200,000.

With Regular Expression matching, you can ask the question "do we have a match", and that returns a Yes / No flag - so it doesn't matter how it matches. But with regular expression matching to a string, you can also ask how it matches (i.e. "please return to me the bits of the incoming string which match each part of the pattern") and in that case the how does matter.

Let's see a series of examples, and I've chosen to use Perl. Here's a string:

$source = qq%Please "press" the enter key gently, don't "hit" enter!\n%;

Using a default (that's a greedy match, counts such as + will match as many characters as possible, so:

  $source =~ /(\w+)\s*(\w+)/;

  print "We have  and \n";

will match as many letters as possible to the first \w+ - thats "Please". It will match the space to the \s*, but then fail at the double quote as it wants at least one word character. So it then steps backwards, matching just "Pleas" to the first \w+, nothing (no spaces) to the \s*, and the final e of the word Please to the second \w+. The result is
We have Pleas and e

Using a sparse count - that's +? with the extra "?", we match as few characters as possible in each count. So:

  $source =~ /(\w+?)\s*(\w+)/;

  print "We have  and \n";

will match as few letters as possible to the \w+? - that's just "P". It then finds no spaces which satisfies the \s* and it matches lease to the final \w+. The result is
We have P and lease

There's a third type of count - a possessive count - too. It was added at release 5.10 of Perl, and it's available in other regular expression engines too such as that in Objective C. It's a greedy count too, but with the difference that it will not step backwards to look for a shorter match once a longer one has been found for the particular count. To request a possessive match, add an extra + after the default count, so:

  $source =~ /(\w++)\s*(\w+)/;

  print "We have  and \n";

This will match the "Please" to \w++, the space to \s*, and will then fail as it tries to match the \w+ to the double quote. It will not step back in the way the default did, so it will start matching the \w++ to the word press. Once again, it will fail to match at the next double quote, and will move on rather than stepping back. It will match the \w++ to the word the, the space to \s*, and then (successfully!) the second \w+ to the word enter, giving a result
We have the and enter

You'll note that in this example, the possessive count results in a dramatically different match - though that won't always be the case. In fact, the documentation states that the main purpose of this new count is to allow the programmer to write regular expression matches that run faster as needless backtracking and matching attempts can be avoided.

Possessive regular expressions won't get more than a brief mention on most of our courses, although we'll talk about them (and perhaps show a demonstration) on Perl for Larger Projects if delegates have run time concerns. We will also cover them on our Regular Expression Course.

The examples that I've used above are shown as a complete program on our web site - [here].

Main Content

Possessive Regular Expression Matching - Perl, Objective C and some other languages