Main Content

Static mirroring through HTTrack, wget and others

Archive - Originally posted on "The Horse's Mouth" - 2009-03-03 03:20:11 - Graham Ellis

Our web site is not best suited to off-line browsing these days - it may be flexible, but if you want to take a copy of it, but it onto a CD, then browse away from the Internet, please resist the temptation. Why is it NOT a good idea to 'blind mirror' us?

1. The Changing nature of our web site. Our pages are adaptive; if you browse from Aberdeen, you'll be offered pulldown menus asking if you're in Aberdeen, Inverness, Dundee, Perth, but if you're browsing from Bristol, you'll be offered Bristol, Bath, Newport, Taunton. If you're browsing with Internet Explorer, some adoption of the HTML will be made to accommodate non-standard features. Your previous visit history will be noted and you'll have different options highlighted as our page is presented in a way to help you navigate. None of these features can work from a mirror CD!

2. Our size. We've got around 15,000 different URLs on this web site ... pages ranging from pictures of Gosport Station to using Utility methods to construct objects of different type in Python, and it's unlikely that you'll want them all - so mirroring is a very slow and very blunt tool which hurts ...

3. Our bandwidth. It's a serious resource hog if you try to copy all of our pages. You're costing us a lot of bandwidth, you're slowing down others who are trying to use our site - basically, you're being antisocial (though probably not intentionally so!). And do you know the worst of it ...

4. Out of date. Your mirror copy will rapidly go out of date, as this is a dynamic site where new examples are added, links updated, and comments amended somewhere all the time. Having spent a lot of time creating a traffic jam, you'll find that the destination really wasn't worth going to.

5. Copyright issues. I am also concerned about our copyright issues; I appreciate that duplicating content is easy, but I would much rather provide a feed to people as they need pages than have - as I have found in the past - mirrored pages that have out-of-date or unaltered absolute links, and are said to be in our name - they get us a bad reputation when really they are an imitation, and ought to be the sincerest form of flattery.

If you're thinking of mirroring us ... please don't do it ... and if you have found this page unexpectedly ... our web site probably thinks that you are trying to mirror it, and is asking you not to do so!

How do we detect mirroring operations?

There are certain programs that do it, and we look for things like wget (link) and HTTrack (link) in our User Agent requests / logs. Such signals aren't going to find the people who try to hide what they're doing, but we have other flags that may find them. This is something we discuss on courses such as Linux Web Server which helps you with your httpd deployment.

How should you as a webmaster handle such bulk download requests?

First things first - work out what you want to do. Do you want to allow mirrors, allow part of the site to be mirrored, rudely lock and bolt the front door against mirroring, or hang up a polite sign that says 'please do not mirror'. And if you go for the latter, how do you get your mirrorers to actually read the sign?

If you've decided to restrict your users from mirroring, have a look at robots.txt, and have a look too at the environment variables that are set by the user agent and their use in conjunction with either deny directives or RewriteCond directives. And if you have common include files, you can put some database recording and monitoring in there to pick up unusual traffic flows that are the characteristic of mirroring attempts on larger sites ... all of which make very long subjects for a blog, but for excellent lunchtime discussions on a PHP techniques Workshop!