Reporting on the 10 largest files or 10 top scores
Archive - Originally posted on "The Horse's Mouth" - 2006-08-20 01:58:48 - Graham Ellis
What are the biggest 10 files in or below this directory?
What are the 20 'worst' spams I have received in the last month?
What are the five top scores recorded for a popular game on my web site?
It's a very common requirement indeed to provide a program to answer questions like these, and if you've only got a handful of files /spams / score records, it's easy to write a program to read them all into an array (PHP) or list (Perl, Python), sort that array or list when you've read them all, and print out the first however-many. But that approach becomes impractically slow and memory greedy if you have a big log file ... as for example the quarter of a million records I have in my spam record file at the moment.
Here's the technique you can use to find the top 20 records from several million in a log file - quickly, and efficiently ....
1) Set up an empty list to contain the top 20 (so far) as you discover them.
2) pass through the record one by one ...
2a) Work out the comparsion factor (size, score) for the record just read
2b) If you have already read and stored 20 records, and the new record is below the 20th one stored, reject it OTHERWISE ...
2c) Step through the records retained so far and insert the new one at the appropriate place in the list
2d) If the list now contains more that 20 records, truncate it to 20
3) Print out your results from the list you now have.
You can see this algorithm implemented in PHP here and you can run it here. It's not the simplest of code, but it should aways work no matter how large or how small the cutoff between the 20th and 21st record is (as opposed to alternative algorithms that set a threshhold), and it should always work quite fast even on a data set that's pretty huge; most of the data will be rejected summarily and won't need to be stored at all.
You might suggest that my data should be stored in a MySQL database and not a plain text file ... that's not the problem I was given, and is worthy of an entry here another day!