Finding words and work boundaries (MySQL, Perl, PHP)

Main Content

Finding words and work boundaries (MySQL, Perl, PHP)

Archive - Originally posted on "The Horse's Mouth" - 2008-08-03 10:59:46 - Graham Ellis
If you're searching for the word "mile", you probably don't want the page that tells you that Sally Smiled at Harry. But you may want to find a Milestone, even if it is within quotes.

Regular Expressions are your friends!

In Perl style regular expressions (which also work in Python, and in PHP with the preg functions), the \b anchor (or 'zero width assertion') matches at a word boundary. In other words, it will let you find positions in your text string which have a special character followed by an alphanumeric, or an alphanumeric followed by a special character. It also matches at the very beginning or very end of the string if the string starts / ends with an alphanumeric.

example: /\bmile/i matches - ignoring case - words starting with mile.

In MySQL regular expressions (used with REGEXP and RLIKE matches), you have tow different anchors. [[:<:]] matches at the start of a word and [[:>:]] matches at the end of a word. Slightly longer / more complex, but probably a little quicker to run.

My personal suggestion - if you are searching - is to look for the search term anchoring the start but not the end to a word boundary. That way, you find all the "es" "ed" and "ing" words - end, ended, ending, but it does not send you round the bend with lots of spurious hits.

Our blog archive at http://www.wellho.net/mouth/ adds a column of "related short articles" down the right hand side to help you navigate to similar subjects. Until yesterday, we were reporting similar articles based on the subject of the current blog having one of its words appear within the subject line of another article - done that way from the early days of the blog to get a good spread of links to extra aricles. However, that list was getting long and I updated the script to use a MySQL regular expression - and now we have a list that (in most cases) has been trimmed back to a manageable size, and had a heightened relevance. For example - from 17 further links on that first page down to 12.

In other words - the clause
where entry_title like '%$word%'
has been replaced by
where entry_title rlike '[[:<:]]$word'

As an aside - we also eliminate a few common words from the page matching - here's the regular expression used on each word in a list from the title.

'^(the|from|and|you|our|why|they|via|that|can|all|'.

'use|your|big|how|etc|for|one|two|not|after|work|but|get|are)$'