Refactoring - a PHP demo becomes a production page
Archive - Originally posted on "The Horse's Mouth" - 2008-09-12 15:05:19 - Graham Ellis"Refactoring" is a term that I've come across in Extreme Programming, but it's also a relevant topic to consider through the life cycle of any software. Perhaps I had better give a definition ....
Refactoring - the updating / alteration of software or systems, usually done in order to take into account a changing requirement.
A couple of years ago, I wrote a little demonstration during a course that took our daily web site log file, analysed it,and reported on the most popular pages on our web site. In those days, the daily log files were around 2 Mbytes each but that has now risen dramatically - it's been over 30 Mbytes per day for the last 3 days, and that means that techniques that I used in my initial demonstration - quick and easy to write, but relatively slow to run - are no longer totally appropriate. And at the same time as the data increasing, I've extended the program's output from a demonstration of reporting the most popular pages into a much more thorough analysis of web server accesses - looking at accesses to our web server by country, and also what proportion of our traffic is from robots. All of which has meant refactoring the code as it has progressed - an ongoing process .... Today, it's a page that provides us with a whole lot of information about our most visited pages.
What are some of the aspects involved?
a) Moving from a "recalculate everything each time" type operation to one where elements of caching are involved. This is at two levels.
Firstly, within each analysis - once we have identified a visiting IP to be from a particular country and calculated whether or not it's a spider, we retain that information through the rest of the file analysis on the basis that the IP address cannot move country against our fixed lookup scheme, and that it's very improbable that the same IP would be used for both a regular visitor and a spider.
Secondly (And not yet implemented as I write), there is little point in repeating the analysis many times each day for a log file that turns over in the middle of each night. Better to store the results of analysing the huge file and read the analysis results to produce the report that to re-analyse every time. Do note, though, that we don't produce a static file we can simply save as our script does allow a variety of parameters to be passed to it to tailor the report.
b) Breaking out data to include files. Early on in the life of our script, we added a few lines of code to test the browser - to see if the user agent string contained something like "MSIE" in which cas we could identify the visitor as being Microsoft Internet Explorer. That same logic is also shared by our recent visitors page.
By moving the table of browsers out into a separate file, we can now include an ever expanding and changing table of browser strings from a single source in both applications - and can indeed easily update it to provide further browser data without having to change several files that it's hidden in the middle of.
c) Moving from efficiency of coding to efficiency of running. For the analysis of a small data file, a simple set of regular expression matches to work out which user agent is a robot, and which is a real user, sufficed. But that gets very slow - especially where there's likely to be a very large number of different strings. The code has been modified to use a much faster strpos to identify certain common browsers without the need for a regular expression at all ... all meaning that the work can be done within the time the user would expect to be taken for a web page refresh.
Here's an example - showing both caching and efficiency changes - from within our script:
if ($spip[$line_els[0]]) {
$isspider = $spip[$line_els[0]];
} else {
$isspider = 1;
while (1) {
if (strpos($line,'MSIE')) break;
if (strpos($line,'Firefox')) break;
if (strpos($line,'Safari')) break;
if (eregi($spider_reg,$line)) $isspider = 2;
break;
}
$spip[$line_els[0]] = $isspider;
}
You'll note that we use the array $spip as a cache of data about which IP addresses are used by spiders - taking data from that cache if it's available in preference to doing a more complex analysis. When we do the analysis, we use strpos calls to rapidly eliminate the most common browsers before we go on and match to a (quite complex) regular expression that we have made up from the contents of a browser include file. Here is the include file ...
<?php # Browser Identity Strings - Spot the Spider!
$browsers = array (
"firefox" => "Firefox",
"iceweasel" => "Iceweasel",
"safari" => "Safari",
"netscape" => "Netscape",
"konqueror" => "Konqueror",
"opera" => "Opera",
"NutchCVS" => "Nutch Spider",
"wget" => "Wget",
"msnbot" => "MSN Spider",
"googlebot" => "Google Spider",
"us/ysearch/slurp" => "Yahoo Spider",
"WISEnutbot" => "Looksmart Spider",
"Ask Jeeves/Teoma" => "Ask Jeeves Spider",
"Naverbot" => "NaverBot Spider",
"www.almaden.ibm.com" => "IBM Almaden Spider",
"findlinks" => "Findlinks Spider",
"SocietyRobot" => "E Society Spider",
"ia_archiver" => "ia_archiver Spider",
"Accoona-AI-Agen" => "Accoona Spider",
"psbot" => "psbot Spider",
"seekbot" => "seekbot Spider",
"aipbot" => "aipbot Spider",
"rssimagesbot" => "rssimagesbot Spider",
"happyfunbot" => "happyfunbot Spider",
"msie" => "Internet Explorer",
"Twiceler" => "Twiceler Scraper / Spider",
"Xerka WebBot" => "Xerka WebBot / Spider",
"Yanga WorldSearch Bot" => "Yanga WorldSearch Spider",
"ShopWiki" => "Shop Wiki Spider",
"MJ12bot" => "Majestic 12 Spider",
"Gigabot" => "Gigabot Spider");
?>
... please feel free to use these user agents which I have found amongst those on our site!