Using Perl to generate multiple reports from a HUGE file, efficiently
Archive - Originally posted on "The Horse's Mouth" - 2011-12-09 06:58:55 - Graham EllisIf you want to extract two distinct reports from a large data source, there are a number of ways you could do it. The first two are not brilliant:
1. You could read the entire file into memory, and then traverse it several times in a loop. This is a poor solution if the data becomes huge - the footprint of the program becomes massive, it may start swapping on and off the disc, and indeed it may crash "out of memory".
2. You could read the file multiple times. This is hard going on the disc, and potentially very slow as disc access times can be significant.
The third solution, which I describe fully below, is MUCH better ... you can read your data in record by record, just once and store the data you need for each report into separate variables as you go along. There's a new example from this week's Perl course [here].
Processing a web access log file (30 Mb but could be far bigger!) line by line:
while ($line = ) {
@parts = split(/\s+/,$line);
I built up strings with the extracted data that I needed for huge URL reads
if ($parts[9] > 1000000) {
$huge .= "$parts[3] $parts[8] $parts[9] $parts[6]\n";
}
and for requests that generated server errors
if ($parts[8] >= 500) {
$server .= "$parts[3] $parts[8] $parts[6]\n";
}
all within the same read loop - here's the end of the while loop:
}
Then - after the file reading was completed - I printed out the results:
print "$huge\n";
print "$server\n";
That same example has been expanded ... into a third report. I can (of course) add as many reports as I like to this, but in this third case I've used a list instead to collect the data I need within the same while loop that reads the whole file:
if ($line =~ /Trowbridge/) {
push @toon,"$parts[6] $parts[9]\n";
}
This has then allowed me to reorder (sort) the report data before sending it to the output:
@toon = sort(@toon);
print "Trowbridge, by page name:\n@toon\n";
Example written on this week's Learning to program in Perl course.