Web Site Loading - experiences and some solutions shared

Archive - Originally posted on "The Horse's Mouth" - 2009-02-26 09:27:15 - Graham Ellis
I can recall a colleague of mine (OK, if you're reading this Peter, yes you were the boss) using the term "Open Kimono" to describe certain approaches at certain times, and (truth be told) I wasn't sure if there was something a little naughty in the connotations that the term conjured up. Yet the term came back to me this morning when I was wondering whether to post up some recent experiences / comments from the growth curve we have been seeing in resource usage on our web server. But I think I'm OK to use the term ... the My Open Kimono Blog uses it, for example.

Why move the web site?

Here goes. We moved our main domain to a dedicated server six months ago. Traffic levels were such that our daily Apache httpd logs on a shared server in the USA were around 15 Mbytes each, and we had concerns at the lag time taken for traffic from the UK (our primary market) to make the round trip. We were also concerned that search engines were seeing us, with a ".net" top level domain, as being located and trading in the country in which our server was located rather than in the UK. And we had some security concerns with regard to the peaky load that others were putting on the server, and the possibility of PHP injection attacks into our scripts by others on the machine (or, rather, due to loopholes left by others sharing the system - see here)

The first problem, and a warning sign

Anyway ... a few of the teething troubles that were only to be expected as we learned out way into the new server, and the web site was transferred and live in a quite remarkably short time. But then it died in the middle of one night. And that technical story is told here. Finding an issue like this is rather like looking for a needle in a lot of hay - not even in a single haystack, as the potential issues are many any varied, and there can be just the one trigger.

But there was a serious latent issue. How could a single script's running - even if it caused 20 seconds of cpu time to be burned up, cause an ongoing problem, as it appeared to have done when it ran that night?

The Current Issue

Traffic has now risen; from a 15 Mbyte daily log file in July, traffic has risen in less that 6 months to peak at nearly 50 Mbytes per day ... and we have seen other occasions when the server's queue length as reported by uptime - usually between 0.2 and 0.8 - has swept majestically upwards to 40, 50 or more and has stuck there. A temporary cure has proven to be easy enough - just a stop of the httpd and mysql daemons, then a restart and the whole thing has started purring along sweetly until the next time.

So have httpd and / or MySQL been stuck in some sort of loop?

No - I don't think so. I think we have simply filled up the server's memory and it's been running on the backup of 'swap space', with more processes / threads of httpd and MySQL than can fit in the real memory fighting for that space, and with the disk 'thrashing' about. And more requests will be joining the queue, now quicker than completed ones are being peeled off. In other words, it's a self perpetuating problem which, once it has started to occur, is likely to get progressively worse. Unlike a bus queue where you can see you've got a wait ... pop off and get a coffee and come back a bit later ... you have no such option on a web server ...

... and in effect it's made worse by the driver of each and every bus having to stop and re-organise the queue on every trip, thus cutting down the capacity for the queue to be handled at the very time it's most needed!

Evidence

What evidence do I have that it's pure load rather than one particular script? Well - the problem was triggering just after 6 a.m. in the morning, on some mornings - and that's the time an extra load (a server backup) gets added on to the job queue - actually several jobs, including a database dump and some tars. Each runs perfectly well manually, at a quiet time, but if the server happens to be busy they'll start popping it in to an unrecoverable overdrive.

And then the problem triggered, it seemed, at around lunchtime and again between 4:30 and 6:00 in the afternoon - the busiest times on our server, with the UK and European traffic heavy just after noon, and then the UK traffic still very busy at the time the USA traffic was picking up too towards then end of the afternoon / early evening. And Saturdays and Sundays, when our servers are notably quieter, it ran sweetly (this gave me false hope as I tried to fix the issues at the weekend!)

There's a technical article here in which I show a top report comparing our server when well behaved and when thrashing.

Possible Solutions

More buses, more efficient buses, and taking measures to turn the very occasional person away when the queue is starting to get to the "needs marshalling" stage. We can also make sure that everyone in the queue really wants to travel!

How do those work in web server terms?

• More buses.

For the moment, let's put that one on the back burner. We could cross the palms of our WSP with more silver each month, but there's little point in purchasing something that's not needed.

• More efficient buses.

If we can get the buses to run trips more quickly, the same number of buses will handle more customers and will stop the queue bursting. There are quite significant elements of PHP in most of our pages, and quite a bit of MySQL too - indeed, most of our images are fed from a database.

I have reduced bookkeeping operations in our scripts so that they're run not on every page, but only randomly on around one page in five. A few excess records in "what happened in the last quarter hour" really don't matter.

Various other smaller actions.

And the big one - I have added an index based on the URL to our 15,000 page stats database that we use to provide the relative importance map for Google, and the Google-like search results on our resources pages. It's probably significant that some of the problems only started to occur at around the time that these extra databases started to be collected! [detail]

• Fast track service desks

I have taken about a dozen images which are served several times each served very frequently indeed and moved them to plain files, rather than serving them via PHP and MySQL.

• Limiting the queue

I don't want to turn people away - in fact I HATE doing it - but a very few dropped connections from time to time is far, far better that having the whole queue come to a screaming halt until the server's heartbeat is missed on our monitoring machine which screams for the administrator.

I have tuned our queues ... and there is a technical article that I've added to the site here that tells you about how I've done that.

• Restricting to really wanted travellers

You may recall articles about libwww and Babycaleb earlier on this blog. This sort of traffic, generated by automata, is very peaky and (in the case of the examples quoted) totally unwanted .. the articles linked just above tell you how I have turned away a great deal of that traffic at the front door, and how I have ensured that much of the rest of it is "fast track"ed as above.

• Cutting out needless journeys

Do you like this picture of Charlie, our cat, who's in the habit of coming up to say "good morning" to me when I'm checking my email, and to ask for a stroke and breakfast? It's a nice picture ... perhaps you will come back to this page again in a couple of minutes for another look? Well ... please keep the original copy and look at it again! as there is little point in me giving you exactly the same information, or doing exactly the same work, time after time. The web server can add cache and store headers onto pages (and if you're using PHP to serve images, this is a real "must") and you can also use facilities like memcached to save repeated expensive server calculation operations.

Here's part of our PHP script which manages our image database - the part where it tells the browser to keep the information it's been given for up to an hour, and not to keep asking for it. This is very important for images like your logo which will appear on every page!

# Send out image

header("Content-type: image/jpeg");

header("Cache-control: max-age=3600");

print $imagebytes;

You will also have seen me talking about adding restrictions into our robots.txt file to avoid needless crawling of pages that really shouldn't be indexed, or where our scripts generate URL loops that can trip the spiders. See here and here for some past experiences, and there's a sample copy of our file here. I have added a few 'loop killers' since I wrote that example.

Have you ever seen a nice picture on someone else's web site and added a link to it on yours? It's called hot linking and if you link to an image on an obscure site from a very popular one, you can have a detrimental and sudden effect on that site. There are occasions where our web site suddenly gets hundreds or thousands of hits from our of the blue - and really it's theft of bandwidth and probably of images. We are monitoring / watching such images - you can use my monitor tool here and see what's a popular steal at the moment. And you can read about past comments I have made and technical ways to discourage the habit here.

Finally, you can cut out some excess traffic by telling people that pages are broken. You may recall past articles (possibly no longer around even here) showing how you can divert erroneous URL requests to your site search and return a good page. Fabulously useful technique for real visitors, but it's almost designed to set the search engines off in a feeding frenzy if they get a bad URL - especially if you suggest other searches. Take care with scripts like this ... and ensure that your automata users are sent "404" responses, while being much more helpful to the customer who has just guessed at a UTR by serving him with useful guidance and content.

Where now

I don't think I've reached the end of the story yet. Traffic will go on increasing and - at best - we've currently got something of a lid on it; occasional queues which will potentially get longer. Yes, I know there's a recession or depression on - but it's not depressed or recessed our traffic (perhaps people have more free time and spend more time browsing, quite apart from the fact that this is a rather good site!). So keep reading The Horse's Mouth and you'll see the story continue to unfold.

If you have found this article useful, please remember that we can help you with issues like this in relation to your own servers. We offer Linux / Unix Web Server courses and also a variety of PHP training and a MySQL course too. But in addition / as a starter, please feel free to ask! A day of help of advise may pay for itself a hundred times over - even if I can't come up with a complete solution, I can certainly give pointers and help look at your own, individual case. The easiest way to contact me is via this form and I'll be back to you within 24 hours.

Main Content

Web Site Loading - experiences and some solutions shared