Main Content

Matching to a string - what if it matches in many possible ways?

Archive - Originally posted on "The Horse's Mouth" - 2010-12-17 10:04:21 - Graham Ellis

If you're looking to match and capture part of a string that matches your pattern, you have to be very careful to ensure that you match the correct part of the incoming string. If - for example - I were to ask you what 3 digit numbers the text "I live at 404 and my phone number is 708225" contains, you would naturally reply "404", but there are more three digit numbers there - 404, 708, 082, 822 and 225. And if I were to ask you for all numbers of 3 or more digits, you would add 7082 0822 8225 70822 08225 and 708225 to that list. This is unimportant when you ask "does this string contain some numbers of 3 or more digits", but it is critically important when you then say "what are those numbers because I want to process them".

So what are the rules for making selections in regular expressions? They default to:
• Leftmost starting point and then
• Longest possible match
and then if global / multiple matches are found, it will look for non-overlapping ones.

So ... looking in
  I live at 404 and my phone number is 708225
for a number of three or more digits, a regular expression system would find
  404
and if it was told to do a global match it would find
  404 and 708225

These defaults are sensible - they provide what you want in 95% of cases, and they apply to the following counts:
  * (zero or more)
  ? (zero or one)
  + (one or more)
  {3,6} (from 3 to 6)
  (3,} (3 or more)
and this is known as "greedy matching" because it will grab as many characters as possible.

But there are a few occasions where greedy matching is not what you want. Look at this XML:
  <grand>Aeryn</grand><grand>Zyliana</grand>
with a greedy match to <(.*)> , you'll get a single capture:
  grand>Aeryn</grand><grand>Zyliana</grand
whereas you're likely to be wanting to match each of the tag elements:
  grand, /grand, grand and /grand

You can achieve this though sparse matching, adding an extra ? after the count - thus:
  * (zero or more, as few as possible)
  ? (zero or one, preferably one rather that zero)
  + (one or more, as few as possible)
  {3,6} (from 3 to 6, as few as possible)
  (3,} (3 or more, as few as possible)

That's "sparse" v "greedy" but there's one further element to consider - the .* that I'be used in the regular expression examples, and what it actually means ...

You'll often find you're told that the dot (".") matches any character, but that's not quite true by default; the dot usually matches any character except a new line. That exception is written into the default regular expression engines to stop the sequence ".*" running on from one line to another within a multiline string, and that's a good default, but not always what you want to do. In Perl / PHP [preg style], you can use an "s" modifier - it stands for single line mode - to specify that you truely want the dot to match absolutely any character. In Python, you'll normall do it by adding the re.DOTALL flag onto the end of your regular expression compile.

The principles in this article apply across almost all of the regular expression implementations; I've chosen to add an example in Python on to our web site to illustrate them - source code and sample output is [here]