Python regular expressions - repeating, splitting, lookahead and lookbehind
Archive - Originally posted on "The Horse's Mouth" - 2010-12-17 06:56:52 - Graham EllisIf you're looking for part of a string that's repeated again later in the string, you can capture the first occurrence and then use a back reference (, etc) to refer to "same again". In Python, you can also name the element that you want to repeat - examples [here].
If you have a number of fields on a line, rather than look to identify each field with a match, you'll often find it easier to match the separator - and you can do this in many languages with a function or method called split. In Python, there are two different split methods - one is a method on a string object and splits at an exact (literal) string, and the other is a method on a regular expression, and that one splits at a pattern. Beware - the calling sequences are different between the two splits - they are not polymorphic. See example [here].
Regular expressions can easily become very long and complex and have sections that repeat themselves ... so you should remember that if you find yourself repeating something there has to be an easier way!. In the case of regular expressions, you can often build up your regular expression as a string from a number of elements (which you can reuse), meaning that only the component elements actually appear in your source. If you want to see what I mean, there's a source code example [here].
In a regular expression, you match from left to right and each time you specify an individual character or a character from a group, you move on along the regular expression. Occasionally - VERY occasionally - you want to say "is this followed by" but NOT move on, giving you the opportunity to match the same part of the incoming string against two different patterns, and continue on only of it matches both of them. You may also want to do the same thing but continue on only if the upcoming text fails to match a pattern - this is known as negative lookeahead and turns out to be more useful that positive lookahead. I've added a source code example onto our site for negative lookahead - it's [here] - where we're looking for town names that end in "ing?on", but we're using negative lookahead to exclude specifically "ington".
As well as lookahead, many regular expression handlers offer lookbehind and I've added a negative lookbehind examples [here]. Again - you'll only find occasional good uses for lookbehind.
We provide some coverage of Python regular expressions on our regular public Python courses. More advanced / specialized topics such as lookahead are covered on our Regular Expressions day. Note that if you're on one of our main Python courses and would like an introduction to some of the more advanced features, I can easily be persuaded to take you through some of them after the course finishes one day so that you don't need to come back for the "Regex Special" ...