Do not re-invent the wheel - use a Perl module
Archive - Originally posted on "The Horse's Mouth" - 2009-06-11 18:44:13 - Graham Ellis"If you think 'surely someone has done this before', you're probably right ... and in Perl, you'll find the resource you need available as a module on your system, or if it's not quite to common, on the CPAN". I was reminded of this advise today, when I got involved with web site checking ... and rather than writing my own robotic browser in Perl, I used the LWP module ("Library for Web Processes" in case you wondered!)
What can I do with LWP? Well - I have several new examples to show you.
Reporting all the internal and external links from a page - this uses LWP::Simple, standard on my Perl and easy to use
A short example that grabs a page and echos its content and status, using a minimal series of calls to the more complete LWP module
A script that grabs a web page, then checks all the links from it - a prototype example which needs some more work, but it's already found a broken link to an external site from one of our pages - and such things are very time-consuming to monitor by hand!
Here's an example of the sort of outputs you can get from that last program:
Dorothy-2:perl grahamellis$ perl goodlinks http://www.wellhousemanor.co.uk/
Status from http://www.wellhousemanor.co.uk/whm.css is 200
Status from https://lightning.he.net/~wellho/hotel/reservation.php is 500
Status from http://www.wellhousemanor.co.uk/rooms.html is 200
Status from http://www.wellho.net/happens/rooms.php is 200
Status from http://www.wellhousemanor.co.uk/amenities.html is 200
Status from http://www.wellhousemanor.co.uk/events.html is 200
Status from http://www.wellhousemanor.co.uk/contact.html is 200
Status from http://www.westwiltshire.gov.uk/index/env/env-health-service
/food-hygiene/scores-on-doors.htm is 404
Status from http://www.wellho.net is 200
Status from http://www.wiltshirebusinessoftheyear.co.uk/ is 200
Status from http://www.aguafabrics.com/default.asp is 200
Status from http://www.hoteldesigns.net/industrynews/news_2745.html is 200
Status from http://www.macformat.co.uk is 200
Status from http://www.wellhousemanor.co.uk/art.html is 200
Status from http://www.tripadvisor.co.uk/ is 200
Status from http://www.tripadvisor.co.uk/Hotel_Review-g528775-d645951-
Reviews-Well_House_Manor-Melksham_Wiltshire_England.html is 200
Status from http://www.freeindex.co.uk/profile(Well-House-Consultants-Ltd)
_44477.htm is 200
Status from http://validator.w3.org/check is 200
Dorothy-2:perl grahamellis$