Main Content

Which (virtual) host was visited? Tuning Apache log files, and Python analysis

Archive - Originally posted on "The Horse's Mouth" - 2015-01-23 06:56:40 - Graham Ellis

We host a number of domains on our main server, and in order to avoid fragmentation of log files, we keep a sinle composite log. Rather than use a standard logfile format henceforth, I've changed the second field to carry the virtual host name accessed for the request, as that was missing up until this morning.

So in my server config
  LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
has become
  LogFormat "%h %v %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

Gone - Remote logname (from identd, if supplied).
Added - The canonical ServerName of the server serving the request.

I've written a program (in Python) to take a look at the log file - see [here] - and run from the command line that gives:
  -bash-4.1$ /home/wellho/trainee/y202/pytop
   13358 - www.wellho.net
    2719 - www.firstgreatwestern.info
     223 - www.melkshamchamber.org.uk
     100 - melksh.am
      62 - www.twcrp.org.uk
      39 - www.savethetrain.org.uk
      39 - www.across-the-pond.co.uk
      33 - www.wellhousemanor.co.uk
      30 - twhc.org.uk
      16 - transwilts.org.uk
       5 - thebutlerdidit.info
       1 - railcustomer.info
  -bash-4.1$


The program's also got a web wrapper - if called up on the web, it uses a different formatter:
  output = '{0:6d} - {1:s}'
  try:
    web = sys.argv[1] == "-w"
    if web: output = '<tr><td>{0:d}</td><td><a href="http://{1:s}" target="avh">{1:s}</a></td></tr>'
  except:
    pass

and later in my code:
  print output.format(counter[site],site)

And you can see the current results [here].

P.S. There's another quick demo web analysis program (showing its age) [here].