Main Content

Efficient debugging of regular expressions

Archive - Originally posted on "The Horse's Mouth" - 2010-01-04 08:55:00 - Graham Ellis

Are you a programmer? Have you ever spent *hours* looking for a very odd bug in your code that, when you find and fix it, turns out to have been so blindingly obvious that you ask "why sis it take me so LONG" ... and so does the boss! The good news is that - if the boss is a programmer - (s)he will have been there too and will at least have some understanding and sympathy.

Our Hotel Booking Script has been working well for a very long time - "robust", in fact, but when I entered a booking that came in by phone over the weekend, it went all through the process but then made an error on the confirmation email, which was generated for 2000 not 2010. Some rapid testing showed me that if I didn't enter the year on a booking, there was no problem - but if I specified 2010 it changed it. Odd one, indeed - but one that needed to be fixed.

Have a look at this regular expression, which I used on a trimmed date entry to look for a day, month, year format:
  preg_match('!^(\d+)[-/:\. ]+(\d+)[/-: ]+(\d+)$!', $input, $gotten )
It's been working fine - setting up the day in $gotten[1], the month in $gotten[2] and the year in $gotten[3] for years. To allow for year entries of the "09", "9" or "2009" format, we've taken the modulo base 100 and passed that into mktime to make a timestamp.

Can you spot the error that causes this to fail for "10" or "2010"? It took me longer than it should (and that's once I had tied the problem down to that particular line!

The code:
  preg_match('!^(\d+)[-/:\. ]+(\d+)[/-: ]+(\d+)$!', $input, $gotten )
should have been:
  preg_match('!^(\d+)[-/:\. ]+(\d+)[-/: ]+(\d+)$!', $input, $gotten )

The regular expression element [/-: ] was SUPPOSED to match any one of the four characters in the square brackets, but the minus sign within a list of characters is a special case that gives a range of characters ... / is ASCII code 47, as I recall, and : is code 58 - so that I'll matching any one of / 0 1 2 3 4 5 6 7 8 9 and : ... or in the context that I've used the match, I'm absorbing the entire year as if it was the separator between the month and year, then backtracking to get the last digit (only) of the year into $gotten[3]. Simples!

Net result ... my code as it stood worked for the first ten years of the decade ...

Regular expressions are extremely powerful - the example that I've used above copes with more or less any conventional 3 numeric element date, irrespective of the separators the user chooses (and that's exactly what you want on a web site). However, you need to understand the power and the elements behind them as I have demonstrated today, otherwise you can get caught as I did. We run a Regular Exprssion Course - it's one day - for delegates who are already familiar with a language that supports regular expressions ... that's Perl, PHP, Python, Java, Tcl, Ruby ... on which we cover topics such as greedy and sparse matching, look ahead, grouping and much more. We won't prevent you writing code that very occasionally has problems like the one I described above, but we'll helpp you reduce those problems to a bare minimum, and allow you to find and fix the few issues that you have quickly. I didn't have a big explanation to make to anyone as to why I took do long to fix the problem above - from starting to look, it took less than an hour to tie it down to being a problem with that regular expression, and then to spot what the problem was.