Main Content

Cooking bodies and URLs

Archive - Originally posted on "The Horse's Mouth" - 2009-01-08 02:54:58 - Graham Ellis

With over ten thousand different web pages on our web site, the issue of finding the right resource has become just as big an issue as having the right material available in the first place. Listings by article type and number (example) are great for crawlers / bots, and for staff checking page by page. Division of our material into modules (module list) can help somewhat, but still leaves people having to go through lists with a determination that comes to past students who know that they have a good chance of finding what they need, but misses the casual visitor (and potential trainee ;-) ) completely. That's where a search capability comes in - we've had one for a while, but not everyone says "I think I'll do a site search" so we want to automate that search, and add in a few results on many / most pages. You're probably very familiar with the sort of thing:

But how do I decide what to say in such a small area? How do I include the meat but trim out the fat? I've used regular expressions - and here (coded in PHP) are the specifics of what I have done for this example:

For the body, remove all the markup from the content block that we have stored in a database and trucate it to report just the first 150 characters, adding a few extra characters to avoid breaking it in the middle of a word, and then plonking a "..." on the end to illustrate that it's only the start of the body.


$body = strip_tags($row[body]);
$body = preg_replace('/^(.{150}\S*\s)(.*)$/s','\1 ...',
  $body);


For the URL, take any sections of 18 or more characters between successive dots and / or slashes, and replace them with 8 chars ... 5 chars. These days, URLs are semi-descriptive, often comprising the title of the article with dashes or underscores anyway, and these URLs give some browsers folding problems. But the used DOES want to see the end of the URL to know if it's a ".html" or a ".php" he'll be linking on to. Here's the URL code:

$uddd = "http://www.wellho.net$row[url]";
$uddd = preg_replace(
  '!([/\.])([^/\.]{8})([^/\.]{5,})([^/\.]{5})([/\.])!',
  '\1\2 ... \4\5',$uddd);


There are not the world's simplest regular expressions (far from it!), yet they do show just how much can be done in a single statement. We cover such techniques with PHP specifically in mind on our PHP Techniques Workshop, and in more depth (and regular expressions more generally) on our Regular Expressions day.

You can see more results from these algorithms already in use on our resources pages (example), and in time many (most? almost all?) pages on our site will have an improved and consistent 'see also' along these lines. Key features include:

Automated We don't have to go through and do all the work of adding in extra links on every page - just provide some categorisation hints.

Adaptive We're recording hit / visit counts, so that we can promote popular pages higher up the listings.

Consistent across slightly varied page types. The display should be morally identical no matter what the resource type is. Frankly, you don't care whether the answer to the question "how do I tell Google which country we operate in" is in a forum post, a longer article, or a blog entry written in May 2007 - you just want to find the f***ing answer!


There are some other "salesy" things I could add too. Fast as it happens on the fly, expandable as we have the basis for it to expand from 10,000 different URLs to 100,000 very easily ... marketable - maybe; I'm certainly happy to tell you how we do things like this, and to sell you my time as I tell you and help you understand it in depth. That would be called a private training course.