Main Content

Matching disparate referencing systems (MediaWiki, PHP, also Tcl)

Archive - Originally posted on "The Horse's Mouth" - 2009-05-19 09:08:12 - Graham Ellis

Yes, we are Well House CONSULTANTS and do a bit of specialist coding ...

I have a requirement on my plate at present to write a piece of code for a customer that recognises cross reference codes within a document and turns them into links. And what makes the task quite difficult is that the references come from all sorts of different original sources, with varied formats some of which might even be identifiable in two ways.

We'll be using a regular expression based identification system, but how to make such a scheme logical, easy to follow, and easy to maintain in the future as new references and exceptions to the general rules get added? Well to start with, I'll be using the bunching technique I described last week to make individual regular expression easier to read, and to avoid the need to keep repeating subsection bundles of special characters. But there will be more to it ...

Spring, Summer, Autumn

Most of the cross reference codes will conform to a pattern, or a series of patterns, which can be identified fairly easily. I'll describe these as "summer" expressions, as that's the time of year that most people go on holiday, that places are crowded, and there's a maximum of facilities available for them.

For those who don't manage to catch the summer, there are autumn holidays - fewer people around, and special cases for those who have missed out on the summer; I'm going to describe a series of autumn matches for those references which have been missed by the main filters

Some of the URLs that form special references include an embedded main (summer) reference in them ... so that handling of them can't wait until the Autumn. So for this reason, we'll also provide early-bird spring holidays (or regular expressions) to ensure that it's the proper complete reference that's handled, rather than the embedded mainstream one.

And finally ... I understand there are special cases. We'll call those "snowdrops" - we'll allow them to be individually marked up within documents by the document provider, and they'll be extracted / handled ahead of spring.

A new idea? No - there's nothing much new in this world ... you'll see a similar concept used within expect, with the expect, expect_before and expect_after commands. "Look out for xxx, failing that yyy, failing that zzz". Tcl may be mature but it's still an inspiration!