Looking ahead and behind in a Regular Expression
Archive - Originally posted on "The Horse's Mouth" - 2006-05-22 05:48:17 - Graham EllisRegular expressions in Perl and PHP include facilities called zero width assertions, zero width lookahead and lookbehinds. A case of jargon that looks almost calculated to confuse?
Zero width assertions are where a regular expression matches some sort of condition in the line, without actually consuming any characters from the incoming string - the three most common examples are ^ (must be at start of string), $ (must be at end of string) and \b (must be at word boundary).
There are times when you may wish to say "if followed by", "if not followed by", "if following" and "if NOT following" in a regular expression match, but to not actually move backward or forward over the incoming string - for example, in a spell checker I was writing yesterday (source, read about it and try it out) I was looking to split my incoming string at each word boundary, but only if NOT following or followed by a single quote. And, crucially, the single quote character was not to be included in the matched string itself - I was just saying "no break here" in the case of words like hasn't and I'll. This is a requirement for a zero width negative look behind written (?<!') and a zero width negative look ahead written (?!').
Here's the complete regular expression of my example:
$elements = preg_split("/\b(?<!')(?!')/",$page);
Footnote - Zero width positive lookaheads are written (?=xx) and zero width positive look behinds are written (?<=xx), where xx is the expression that you're looking back or forward to match