Main Content

XML, HTML, XHTML and more

Archive - Originally posted on "The Horse's Mouth" - 2008-11-23 07:19:56 - Graham Ellis

HTML is a language ... but XML is a Metalanguage. In other words, you can write something in HTML and have it (quite) well defined, whereas anything you write in XML has to have another layer of definition there to tell you what's valid and what isn't. XML is a set of over-arching rules in which you can define your own, XML compliant, language ... or use one that someone else has already defined for you, such as RSS, SOAP, REST, or XSLT or XHTML. Here's a diagram:



It's been said - and it's usually the case - that if you define your data using HTML, then you're defining how it looks, whereas with XML you're defining what it is. For example:

HTML - says how it should appear:

<h1>Melksham Town Center</h1>
<ol><li>Woolworths, Boots, Peacocks and Iceland
<li>All the major banks
<li>Tourist Information Center and Post Office
<li>Bus to Bath, Devizes, Chippenham and Trowbridge
</ol>


XML - says what it is:

<place>Melksham Town Center
<facility><chains><item>Woolworths</item>
   <item>Boots</item> <item>Peacocks</item>
   <item>Iceland</chains>
<banks><item>All the major banks</item></banks>
<general><item>Tourist Information Center</item>
   <item>Post Office</item></general>
<bus><item>Bath</item> <item>Devizes</item>
   <item>Chippenham</item> <item>Trowbridge</bus>
</facility></place>


From that, you'll see that you can see how the HTML will be displayed, but you don't know how the XML will be used of displayed. There needs to be a tailored intermediate piece of software specifying that, and doing the work. You may come across:

SAX - The Simple API (Application Programmer Interface) for XML (Extensible Markup Language)

Using SAX, a stream of XML data is passed through a process (scanned by a program) and the interesting bits that the program needs are collected as it does. I've describe this as pouring a lot of water through a sieve, and catching the bits that you want in the sieve. SAX is ideal for getting a few specific elements out of a very large flow of data, but is exceedingly poor for reading XML to edit and re-save it.

DOM - The Document Object Model

In DOM, Data is parsed into a structure in memory. An XML document is a series of tags ... so each of those becomes an array or list (depending on the programming language that you're using), and within each of those you have other tags which in turn become further arrays or lists within the first. Attributes - not touched on in this article - become hashes, dictionaries or associative arrays, and the text data is stored as strings in the arrays. So this translates from file to something that can be held in memory and, with carefully written recursive code, manipulated very flexibly indeed. DOM is good for smaller data sets, and it's a great tool if you want to edit and save changes to your original XML. It's not going to work for you if you have an enormous XML file.

XSLT - XML Style Sheet Language Transforms

XSLT is a language which allows you to specify how your XML is to be transformed, SAX style, as you parse it. You can write formatting information, tags, loops and all the other things you're used to in XSLT ... and the result of an XSLT transform is likely to be XHTML. Let's say you have 60 staff, with an XML file holding records for each of them. And you want to display the data in 3 different ways. Then you'll write 3 XSLT files to define the mappings, and the result will be that you can get any of your 60 x 3 (=180) possible displays. XSLT happens to be itself defined to the XML standard ...

Cocoon

Apache Cocoon is a system that allows you to take XML and transform it into different formats for different purposes - taking my "staff record" example again, I could set up Cocoon to give me postscript files for printing, XHTML for display, selective XML for public release via a news feed, pdf for producing a flyer about the employee, and so on.

And how does XHTML fit into this?

XHTML is HTML with the additional rules of XML enforced - so that although you're laying out how things are to be displayed rather than what they are, you're also specifying that in a consistent form that's easy to edit with HTML editors and will cause less headaches as you view your page on different browsers - assuming you stick with standard tags!

Our example in XHTML:

<h1>Melksham Town Center</h1>
<ol><li>Woolworths, Boots, Peacocks and Iceland</li>
<li>All the major banks</li>
<li>Tourist Information Center and Post Office</li>
<li>Bus to Bath, Devizes, Chippenham and Trowbridge</li>
</ol>