Main Content

Spraying data from one incoming to series of outgoing files in Perl

Archive - Originally posted on "The Horse's Mouth" - 2012-08-15 14:25:35 - Graham Ellis

Scenario. I have a lot of data that contains large numbers of records which I want to separate into groups. For example, an incomeing web server log file which I want to split out and process visitor by visitor.

Using Perl, I can loop through my data line by line and store it into a hash - for example:
    while ($lyne = H>) {
    $lyne =~ /\S+/ ;
    $all{$&} .= $lyne;
    }

• In a web server log file, the IP address / name of the visiting server is the first non-space string on the line
• It's perfectly valid to do a regular expression match outside a condition - if you're working with an automatically generated data file that does not need any validation, this is acceptable practise too
• $& is the special variable that contains "the bit that matched" after a regular expression match in Perl. If the incoming string is massive and there are lots of matches in a tight loop, you *may* be a bit inefficient if you use $&.
• The .= operator adds on to the end of an existing string. If I had wanted a list of accesses (rather than a string contaning them all), I could have pushed each recrd onto a list within the hash (but that would be at a later point in the course).
• Implicit reference to a variable such as the hash %all in my example will cause it to be created if it doesn't exist (the very first time through the loop), and each new element in that hash will similarly be created as necessary. In a longer program, creating of a local hash via my %all may be appropriate.

Using the code above, I then output each of the members of the hash, so grouping records by visiting client, and within visiting client by date and time since that's the order that are stored in the original file:
    foreach $visitor(keys %first) {
    print $all{$visitor},"\n";
    }

• Note the extra \n. In Perl, you always need to think about your new lines. In this example, they're present on the records when read in, they are notremoved with chop or chmomp, so they are kept within the $all string as record delimiters. I've added the extra one in the output code just to provide a degree of separation between the blocks.

The complete program that the snippets above are copied from is on our web site - [here].