Main Content

Learning more about our web site - and learning how to learn about yours

Archive - Originally posted on "The Horse's Mouth" - 2011-12-17 12:15:00 - Graham Ellis

There are quite a number of tools out there which will give you statistics about your web site - and quite a lot of people who will tell you various statistics about yours and theirs. But there's "Lies, Damned lies and statistics" according to Benjamin Disraeli. How do you really understand your traffic and site? I think you should look at it in lots of different directions, understand how the figures are reached, make incremental changes to your methodology to explore the feel of the site in more detail, and crosscompare multiple sites and multiple time periods.

We keep (Apache httpd) log files on our servers, look at them with certain tools on a regular basis and try out other things (and some new things) from time to time.

Here are some statistics from a demonstration program I wrote yesterday, and from an example written on a previous course but re-run with the same set of log files - which are from our main server for November 2011, and total over 800 Mbytes of input data.

Statistics and diagrams

  completed ac_20111104 8825 visitors
  completed ac_20111105 5900 visitors
  completed ac_20111106 6436 visitors
  completed ac_20111107 9562 visitors
  completed ac_20111108 10192 visitors
  completed ac_20111109 10114 visitors
  completed ac_20111110 9871 visitors
  completed ac_20111111 8862 visitors
  completed ac_20111112 6181 visitors
  completed ac_20111113 7002 visitors


  code 200 - count  3037207 -    95.36%
  code 206 - count     3459 -     0.11%
  code 226 - count        4 -     0.00%
  code 301 - count     3118 -     0.10%
  code 302 - count     2538 -     0.08%
  code 304 - count    30282 -     0.95%
  code 400 - count      347 -     0.01%
  code 403 - count    10750 -     0.34%
  code 404 - count    96625 -     3.03%
  code 405 - count       26 -     0.00%
  code 408 - count       65 -     0.00%
  code 416 - count        1 -     0.00%
  code 500 - count      259 -     0.01%
  code n/a - count      188 -     0.01%


  Sum of distinct hosts each day -      269544
  Number of distinct visiting hosts -   182849
  Total URLs requested -               3184869
  Total web pages requested -          2138893


The above statistics are from yesterday's program - source code [here].










The above diagrams are from a Python program that uses numpy and matplotlib from a prior private advanced Python Course, rerun on the same data that was used for the statistical tables. Source of that program [here]

Methodology

a) Analysis of log filer. Both of my programs have read through each of the daily log files line by line, and extracted required data from each line. Part of the analysis for the statistical program differentiates between "primary URLs" - the sort of thing you would type into a browser - and "Secondary URLs" - things like images, icons, style sheets and JavaScript which typically aren't fresh page requests from a visitor, but are called up from within other requests. We have very little ajax traffic, and very few pages indeed with Frames to there was no need on my sample demonstration program to make allowances for the skew which they would add.

b) Elimination of parameters. Many of our pages can have parameters supplied via the "GET" method, and we have used regular expressions to trim those values off the end of the URLs when we came to count accesses to different pages. As a separate exercise, analysis of these strings could be very useful indeed.

c) Graphics. The images are all showing the number of URL hits (primary and secondary) within an hour period, joined to form a contour plot / heat map. A more technically accurate display would be a block diagram - a 3D historgram, as the data isn't really "sloping" in the way shown. Never the less, the displays are very effective in highlighting the way traffic increases and decreases during the day. Even on a site with traffic as high as ours, spikes can occur and there's a certain randomness. The third diagram is intended to help demonstrate undelying trends, but care should be taken in reading any significance into the figures. The maximum figure shown (7000) is certainly not the maximum number of requests made in an hour (9000)

d) Not sum of daily. One of the big myths ... is that 1000 unique visitors a day means 30,000 unique visitors a month. It doesn't; visitors come back to mamy web sites day after day and for an average of 1,000 unique visitors per day, you would hope that the "Unique visitors per month" figure was well below 30,000!

e) Broken lines. Our anaysis shows a few "n/a" status codes. The log file format that's used by httpd needs a bit more reverse engineering than I've used to get every line 100% right - but with no more than 7 lines in 100,000 having problems on a simplified algorithm, I've chosen to go with that.

Conclusions

1. Weekly Cycle. This is fantastic news for us. Look how the traffic during the week (Monday to Friday) is hovering around 10,000 unique daily visitors, but that's down to 6,000 to 7,000 at the weekend. Friday's a lower figure (POETS day - Piss off Early; Tomorrow's Saturday) helps confirm work / business customer use. And the lower figure on Friday, with Sunday higher than Saturday too, possibly reflecting Muslim counties with a Friday / Saturday weekend, or possibly reflecting UK habits of going out on Saturday and doing hobby things including computing on Sundays.

2. Daily cycle. (From the graphics only). A very interesting demonstration of peak traffic during the UK working day, with a surprisingly early start (perhaps because India is about 5 hours ahead of the UK), and a busy evening (we also get considerable traffic from the USA as other analyses have shown).

3. Repeat Visitors. There were 183,000 unique visitors in the month. But there were 270,000 visitors if you add up the number of unique visitors each day. So that means 87,000 return visits. Bear in mind that I visit every day - so that's 29 repeats - it's NOT 87,000 different returning individuals, but it's still an interesting statistic!

4. Images / Avatars / FGW. Here's an interesting piece of background. Our domain / server also hosts some images (and my avatar) used on the First Great Western Coffee Shop, and that's a busy site and active forum. This will account for some of the difference between the 2 million pages and the 3 million URL requests. Further analysis called for, I think.

5. 403 / 404 / 500 comments 19 out of 20 accesses to the server returned a good page and response - code 200. Many other return values (206, 301, 302, 304) are perfectly acceptable in moderation. But what about the other codes? Common wisdom has it that you don't want any 400 or 500 series errors, but to some extend I disagree. There's nothing wrong in sending a search engine crawler a "404" page not found if a page has been withdrawn and not replaced, for example. The particular server that we've analysed for this report goes further, intentionally returning code 403, 404 and 500 to requests which are testing the security of our site / looking for holes - we're saying "Go away - that's not here", "You cant have that" and "broken" where appropriate to these nastys - in a (perhaps vain) hope that they'll stop knocking on the door.

6. Staying power. Each visiting host made 17 requests. There's a lot more analysis possible here. Yet, interestingly, on our site we consider that a single page hit is often a success - someone lands from a search engine on a page that answers their question. Job done. Also marketing done - our name's out there and they may well remember how helpful we are in the future when they need a course.

7. Monetarise. An interesting suggestion has been made - that we should cash in / make money from our very heavy traffic - advertising, click-thru, agent sales, charging for use, building up a saleable email address database are all possible. We're very careful about venturing down these paths - we monetarise via course and hotel room sales at present, and I suspect that majority of users of our page don't want to be added to lists from which they're barraged with emails. That is OUR. We may build more agency sales at some point, though.

8. Much more! Which pages? Parts of world? I have only just started to scratch the surface.