Checking robots.txt from Python
Archive - Originally posted on "The Horse's Mouth" - 2009-07-12 14:00:57 - Graham EllisThe robots.txt file - which well behaved automata check to see whether they are welcome on a web site - has two directives in its base specification ' User-Agent and DisAllow. You will find some other directives used, and you will find some sites who have a robots.txt file that has blank lines after the User-Agent line, even though (in the specification) the block for a user agent ends at a blank line. These rules, and web master's lack of knowledge of the detail, mean that some sites don't have their robots exclusion file as effective as they would wish.
I have written a very short Python example here which reads a robots.txt file via http protocol, and analyses it to report on the active User-Agent and Disallow lines - not only as a sample program on today's Python Course, but also to allow me to do a quick sanity check of robots.txt files.
Features of this Python example include ...
• Checking the number of command line parameters
• Connecting to a remote web resource and reading it as it it was a file
• Use of exceptions