Databases - when to treat the rules as guidelines
Archive - Originally posted on "The Horse's Mouth" - 2011-10-23 18:16:11 - Graham EllisOne of the "rules" of database design is that data should only be stored once. Another is that calculated values should not be stored - they should be calculated every time you ask for them. That way, you are going to get consistent results from database enquiries. Of course, if the data you've stored is wrong your results will be consistently wrong, but that then makes the errors easier to find and once you're fixing them you can be sure that a single fix will put everything derived from it right.
Except ...
At times, it's not so much a rule, more a set of guidelines (I remeber that line from the "Pirates of the Carribean" - told in relation to the Priate's Code). Let me give you a couple of examples.
1. Caching results. Let's look at some queries ona web site:
"What time is the next train to Swindon" [Looks it up] "19:47"
"What time is the next train to Swindon" [Looks it up] "19:47"
"What time is the next train to Swindon" [Looks it up] "19:47"
"What time is the next train to Swindon" [Looks it up] "19:47"
"What time is the next train to Swindon" [Looks it up] "19:47"
"What time is the next train to Swindon" [Looks it up] "19:47"
Hang on ... we're repeating ourselves here and may be using up an awful lot of resource in the process. We would do far better to store the result in some sort of temporary cache which gets cleared out when the source data changes. Many databases are "read mostly" after all.
2. Where results are multiple short fields from a record that contains one or more huge fields. For example, the images in our picture library are stored in a database table, with each image having fields such as a brief description, an image name, and an id (all of which are quite short) ... and each record also contains a longblob which is the image itself. That's a great structure for looking up individual images, but it makes searching very slow indeed (the whole 500Mb database needs to be read and that leaves the server discbound) so we have duplicated the small columns into a smaller table - less that half a Mbyte - so that we can search and work with the control data easily. And - again - we re-duplicate this data when new images are added to the database (perhaps 2 or 3 times a day) or descriptions are changed.
To give you an idea of just how much difference this can make ... I've added the extra table to our picture database over the weekend, and the average job queue length on our server at any time has halved. It remains to see how much effect this will have when weekday traffic levels resume tomorrow morning!
In both cases, you'll note, we have designated an authoritative data source and the data that is authoritative remains unique. In that way, we can ensure that any errors can still be fixed at a single point, although an extra process is now required (and that may be as simple as setting a flag to indicate taht derived data needs to be regenarated) upon data change. Or - think of it another way - we're only writing our own caching or indexing system to provide more tuned caching / indexing than the underlying database would provide.