Main Content

Using English can slow you right down!

Archive - Originally posted on "The Horse's Mouth" - 2007-11-25 08:06:01 - Graham Ellis

When you're programming and you assign a variable, that's usually a single result. For example (we're talking Perl in these examples)
$position = $record % $pagelength;

However, there are often side effects of an action which you might want to make later use of too. For example:
$postcode = ($pc =~ /^[A_Z]{1,2}\d[A-Z0-9]?\s/);
will return a true of false result into $postcode indicating whether the string in $pc starts in the correct format for the first part of a UK postcode ...but you might also want to know what the actual string that matched was, and what was left of the incoming string that's not matched by the regular expression. Perl makes this easy for you - if you know what they're called, you can refer to other special variables in your program that contain these resultant side effects.

These "side effects" are a feature of real life too. If I tidy up my office it results in a neat desk. And the side effects include a full rubbish bin and also a box for recycling!

Three of the special variables used in regular expression matching are $`, $' and $& (yes, not the normal rules for variable names). They provide the bit that matched, the bit before the match, and the bit after the match. There are a whole host of others too - both on regular expression matching and many other aspects of what happens when Perl runs - you have $. telling you how many lines yoiu have read since tou opened a file ... $$ telling you your process ID, $^O for your operating system's name and so on.

You can get mighty confused as to what's what if you're new to Perl so there's an additional module supplied with the standard distribution called English and you can call it in with
use English;
giving you extra variable names such as $MATCH and $POSTMATCH and $OSNAME which make you code much more readable.

There is, though, a sting in the tail. If your program is going to use $' $` or $& anywhere in the code, then Perl has to save out the extra variables for every regular expression match it does - even those which are nowhere near the use of those special variables. And that means that if you are matching repeatedly against a very long string, it can have a serious effect on performance to use them. The English module provides aliases to $` $' and $& and in doing so it makes reference to them - thus triggering any Perl program that includes a use English; to evaluate them at every match with this potentially huge loss of efficiency at run time

It's a bit of an "old wives's tale" that English slows you down, so I've written a very short benchmark to show what happens:

$sample = "abcdefghij" x 1000;
$longer = ($sample . "12345") x 1000;
$evenlonger = ($longer . "zzz") x 10;
print (length ($longer),"\n");
@taken = times();
print ("@taken\n");
 
for ($k=0; $k<100; $k++) {
   if ($evenlonger =~ /5z/) { $lc++;}
   print "$lc\n" unless ($lc % 25);
}
@taken = times();
print ("@taken\n");


And that took about 7 seconds of cpu time:

earth-wind-and-fire:~/nov07 grahamellis$ perl tidemo
10005000
0.88 0.7 0 0
25
50
75
100
6.54 0.71 0 0
earth-wind-and-fire:~/nov07 grahamellis$


Adding in a single extra use English; at the top, without even making any other changes like using the names at all in my own code, I got:

earth-wind-and-fire:~/nov07 grahamellis$ perl tidemo
10005000
0.89 0.73 0 0
25
50
75
100
25.8 28.2 0 0
earth-wind-and-fire:~/nov07 grahamellis$


So that's around 53 seconds of cpu time rather than the previous 7.

I always wondered why English seems such a slow way of communicating ;-)