Scraping content for your own page via PHP

Archive - Originally posted on "The Horse's Mouth" - 2009-12-21 07:14:31 - Graham Ellis
If your PHP allows for remote URLs to be handled / read as if they were files (and that's the default), you have useful tool which lets you include the content of one web page (or part of it) within another. For example, I can "scrape" the sections of a coming on a course page and insert them into another page.

Here's an example of the mechanism in use ...

1. Grab the page to be scraped:

$lyne = file_get_contents("http://www.wellho.co.uk/net/join.html");

2. Extract the data you want from it:

$includedtext = "";

preg_match_all("!<dt>(.+?)</dt>.*?<dd>(.+?)</dd>!s",$lyne,$here);

  for ($k=0; $k<count($here[0]); $k++) {

    $includedtext .= "<b>".htmlspecialchars(

      strip_tags($here[1][$k])).

      "</b><br />".			    htmlspecialchars(

      strip_tags($here[2][$k])).

      "<br /><br />";

  }

3. Use the $includedtext within your code

You can try this out [here] and see the source code [here]

This example comes with a string of cautions ...

1. Do NOT allow just any old URL to be scraped, especially one that our users may enter. This leaves you open to having your content filled with their adverts!

2. If you are scraping the same page regularly and it doesn't change very much, you should cache the results and not make the inquiry every time.

3. Respect the robots exclusion standard (robots.txt) of the remote site that you're scraping,, and ensure that you have copyright permission to reproduce the material on your site too

4. Remember that if the remote site's format changes so that your regular expression no longer matches, you'll have a correction to make on your site PDQ!

We currently have examples of the use of scraped material on the Melksham Chamber of Commerce home page and also the First Great Western Coffee Shop. "Take the power of this facility ... but be careful how you use it!

Main Content

Scraping content for your own page via PHP