Main Content

Be careful of misreading server statistics

Archive - Originally posted on "The Horse's Mouth" - 2008-05-28 06:47:30 - Graham Ellis

Here's a mystery for you.

Background

Over the past weekend, I was "fighting" server outages on another computer where - about once an hour - the httpd daemon appeared to be running away in some sort of hole or denial of service attack. Tricky one to find, as the temporary fix I had in place was in the form of a "heartbeat" script that killed all existing connections and freshened up the server. And when the server was busy, it was so much "treacle" that I couldn't run any Linux commands from a shell to see what was going on.

Mystery

I was aware from my heartbeat log of a total of around 20 seconds per hour during which the server was not accepting requests - that's about 0.5% of the time. Yet I had a user who was telling me that in his experience, downtime was around 10%. Wow - that's some scary figure, isn't it?

Any ideas?

Turns out to be a case of how you gather your statistics!

Solution

My heartbeat script clicks in at the start of every minute and if there's a problem it tidies up - 5 seconds. Having clicked in once, it then does a further precautionary clean the following "top" of minute, and perhaps if it's not sure that load levels are dropping as they should, the following minute. So in a bad hour, 4 outages of 5 seconds = 20 seconds.

It turns out that my user was running an automated script to check our server, again at the top of the minute. So he had syncronised his tests to our server in such a way that he always saw it during that brief clean up. Looking at his log activity later, I noticed that if he got a failure he had programmed in a second hit straight away to confirm it - so he was seeing 4/60 or 8/64 failures - that's 6.5% or 12.5% to report.

Lies, damned lies and statistics

This is a "object lesson" in being careful with statistics - at best, they're helpful and at worst they can give a totally incorrect picture. But I have to say that this example really took the biscuit!

Footnote - server issue solved. Availability now over 99.8% and the remaining outages in the last couple of days relate to me testing.