Main Content

Perl Regular Expressions - finding the position and length of the match

Archive - Originally posted on "The Horse's Mouth" - 2006-02-02 04:30:52 - Graham Ellis

If you want to find the position of a match in an incoming string, simply check the length of $` (That's $PREMATCH if you've chosen to use English;) to check where it starts, and add the length of $& (that's $MATCH) to find where it ends.

Lets say I want to find all the URLs referred to in a web page that's loaded into the variable $html. I could write:


push @section,[length($`),length($&),]
while ($html =~ m!(https?://[^ >"]+)!g);


and that will give me a list of 3-element lists containing start point, length and actual string matched. Here's the code to display that list:


foreach $element(@section) {
print (join(", ",@$element),"\n");
}


and here's some of the results from the sources of our resources index


5979, 36, http://www.wellho.net/forum/top.html
6967, 36, http://www.wellho.net/net/mouth.html
7059, 42, http://www.wellho.net/downloads/index.html
8369, 67, http://www.wellho.net/mouth/387_Training-course-plans-for-2006.html
9365, 43, http://www.trainingcenter.co.uk/travel.html
9516, 45, http://reiseauskunft.bahn.de/bin/query.exe/en
9599, 59, http://www.livedepartureboards.co.uk/ldb/summary.aspx?T=MKM
9861, 48, https://lightning.he.net/~wellho/net/secure.html


P.S. I loaded my whole web page into a single variable using the code

open (FH,"/Library/WebServer/live_html/resources/index.html");
undef $/;
$html = <FH>;

which is a nice little demo of changing (or removing) the delimiter character for reading from a file handle, via the $/ variable. Once $/ has been undef-fed, reading into a scalar slurps from the current pointer in the file right through to the end of file.