Learning about Regular Expressions in C through examples
Archive - Originally posted on "The Horse's Mouth" - 2010-06-30 19:01:12 - Graham EllisAlthough we more usually teach Regular Expressions on courses on Perl, Python, PHP, Ruby, etc ... there is also a standard C library, which uses the POSIX flavour of regular expressions, and I've put a short example together to "show you how".
Firstly - what is a regular expression?
It's a "pattern match" - you use it so say does this look like that - not checking for equality, but rather checking to see if something conforms to a pattern. But then you fully define the pattern with a regular expression.
So - for example - you could write a regular expression like
^[0-9]{5}$
which means:
• starts with
• a digit
• (five of those)
• and ends
and would let you match the format for an American Zip code.
To load in the C standard Library, you include regex.h:
#include <regex.h>
You then need to "compile" the regular expression:
regcomp(&emma,reginald,REG_EXTENDED|REG_NOSUB);
and you can see if another string matches it:
status = regexec(&emma,millie,(size_t)0,NULL,0);
The returned status is "0" for "yes that matched" and "1" for "no, that did not match".
There's a complete sample program (showing the context of all the various variables in the lines above) - [here]. It reads a regular expression that the user types in (not usually a good idea, as most users don't understand regular expressions!) then it reads a whole series of further lines and tells you if it matches or not. Here's some sample output:
Please give test expression: ^[0-9]{5}$
Validity of regex (0 => OK): 0
Please give test string: 77663
Matched (0 => Yes): 0
Please give test string: 987662
Matched (0 => Yes): 1
Please give test string: 55332
Matched (0 => Yes): 0
Please give test string:
wizzard:c graham$
As you can see, it's very useful indeed - and rather clever.
In the case of a USA zip code, I could simply use the atof function if the regexec function reported a match, but it would be rather trickier with a UK postcode ... or indeed if the zip code was embedded within a full line of text. So in many circumstances you want not only to ask "did it match?", but also to ask "which part of the incoming string matched which part of the regular expression?". With different parameters, regcomp and regexec can return an array of structures so that you can get at this information.
There's a complete source code example of this "match and capture" - [here] - and it's got further comments in it to help you follow how it works. Running this program on a UK postcode, our output included the following:
wizzard:c graham$ ./reg2
Please give test expression: ([A-Z]{1,2})[0-9][0-9A-Z]? +[0-9][A-Z]{2}
Validity of regex (0 => OK): 0
Please give test string: We are at SN12 7NY for this course
Matched (0 => Yes): 0
From 10 to 18 (SN12 7NY)
From 10 to 12 (SN)
Please give test string:
If you're looking for further information about the elements within regular expressions, have a look at our regular expression pages - [here]. C uses the "POSIX Style" which is similar to you'll find in Tcl and Expect ... a remember that we cover regular expressions on a special regular expression course that we run from time to time, as well as where appropriate on language course. This week - thus the blog - it's during a C and C++ course.