Main Content

Helping search engines with appropriate 400 error codes

Archive - Originally posted on "The Horse's Mouth" - 2013-02-11 17:48:52 - Graham Ellis

Our web site's done its best over the years to help people who enter URLs that have slight errors in them - misspellings, wrong directory names, etc ... and we have quite a number of scrips that take parameters which can be modified to give different results - for example
  http://www.wellho.net/net/quote.html?where=L10+1RF
will get you a quotation for a course in Liverpool and
  http://www.wellho.net/net/quote.html?where=17013
will get you a quote for a course in Carlisle, Pennsylvania!

That's very flexible, but it means that there's a wide range of requests that can be made of our servers which in turn can result in the search engines and otherds finding a lot of pages that really aren't wanted / shouldn't be indexed. It just takes one person to put a link somewhere to set off a whole chain, or one malicious spider to start propogating valid but unwanted URLs.

I've spent some time this weekend cleaning up - making sure that we signal rather better to the search engines (and to other visitors too) which URLs are inappropriate. Firstly, some extras in our robots.txt file:
  Disallow: /net/recents.htm
  Disallow: /resources/recents.htm
And that asks any wellbehaved crawler (most are!) not to go into either or these URLs, or any URLs that start with that text. So as well as .html files (which were previously disallowed), we now request that .htm files are not indexed.

And here are some of the error codes that we now generate ...

400 - Your request is a bad one. Most probably you have tried to call up a script with an inappropriate parameter, such as giving us a URL when we expect a number of simple directory name. (Is this an injection attack going on?)

403 - Forbidden. You are not allowed access to this. We may have identified you as some sort of automatic process trying to mirror out site, and that's not appropriate as this is a dynamic site and the output changes.

404 - Not found. Most likely a typo in the URL or a missing page on our site. Possibly an automated probe to see if a page exists. If you get this from a link, please let us know!

410 - Gone. The resource you're looking for does not exist, and is never likely to exist. Please clear your cache of it and don't ask again. When an automated link has accidentally (or maliciously) been generated and points to a whole family of resources, we'll tell you to clear them out via a 410.

Some of these are generated via lines in the .htaccess files:

  # Any crap on the end of "recents" URLs - kill the page!
  RewriteRule ^recents\.html. /err/tiny410.html [R=410,L]
  RewriteRule ^recents\.htm/. /err/tiny410.html [R=410,L]

  # Get rid of query strings with long (injection?) paths
  RewriteCond %{QUERY_STRING} /.*/.*/
  RewriteRule ^smap /err/tiny400.html [R=400,L]
  RewriteRule ^smap.*%3 /err/tiny400.html [R=400,L]

  # If there's more than just a digit on the end, naughty
  RewriteCond %{QUERY_STRING} full=..
  RewriteRule ^landings\.htm /err/tiny403.html [R=403,L]

And sometimes one of our PHP scripts itself sends out a different header:

  header("HTTP/1.1 404 Not Found");

Covered on our PHP Techniques and Linux Web Server courses when appropriate to the delegate group. Also on private courses.