Storing your intermediate data - what format should you you choose?

Main Content

Storing your intermediate data - what format should you you choose?

Archive - Originally posted on "The Horse's Mouth" - 2012-11-20 22:13:58 - Graham Ellis
Many applications require data to be held at intermediate stages - stored. What format should be used? ... HUGE subject.

1. If there is already an industry standard / draft standard way of doing it, think very carefully before going for anything else. The standard will have been designed with ease of use for the particular applications in mind, and the designers will already have considered pitfalls. And if you use a standard format, you're also likely to be able to use a lot of utility programs to use that data that others have written already. If that doesn't work for you ...

2. Do you need to edit it in situ and have lots of people potentially making changes to it at the same time? If so, some sort of database - SQL or NoSQL would be worth looking at, especially if the data is heavily structured.

3. Is the data somewhat free format in that you have various different fields in different records, without many records being complete. If this is the case and you want a readable, sharable file structure then you might want to look at XML or JSON or some other key / value type format.

Data often takes the form of "records" which each have a number of fields in them ... each record of a similar format, and with a limited need to get back in and interactively edit the file. And reading sequentially from end to end may be fine. In which case:

4. Plain text file. Unless your certain about the maximum size of every field on the line, I would suggest that these days you go for a file in which each field is separated by a "cardinal character" - in other words, a character which is special and cannot occur in the content of any field. Commonly used cardinal characters are space, tab and comma. I've also come across colon and semicolon.

If there's ANY chance of the cardinal character appearing in any field, you need to adjust the format. A typical "CSV" (Comma Separated Values) file allows for commas within each field, but the fields must that contact commas as data must then be surrounded by quotes, which in turn means that if you want fields to contain quotes, you need to do something about then. The usual way is to make backslash special - with \" meaning "I really want a " " and also \ meaning "I really want a \". It's usually much easier to use a tab character as separator, especially if it's never going to be contained in the data. This does mean you have to be careful if manually editing the file ...

Further suggestions / notes:

a) If you possibly can, write code that reads the file to ignore blank lines, and lines that start with # characters. In Perl - something like this:
next if ($lyne =~ /^\s*#/ or $lyne =~ /^\s*$/);
in your reading loop. That way, you can edit your data, space out groups of records, and add in comments to describe the format and make other points to anyone who comes along to read the data later on

b) If you're going to have lots of data files of the same type, if you're going to keep the files for a while, if you're going to pass the files onto others, it's a good idea to provide some sort of internal labelling about what the file is - don't rely on the name. This could be done simply by adding a comment line at the top of the file when you write it (see (a) just above) or you could add a separate header line or header block.

c) Where data integrity and completeness is of cardinal importance, you might want to consider "start of data" and "end of data" records to avoid any future problems with truncated data files.

5. Options (1) through (4) won't deal with every scenario. There may be times that you'll store XML in a database, that you'll go for fixed length records, binary encoding and all sorts of other things. You might want to write directly to spread sheet files or produce .pdf documents or even graphics which contain your data within barcodes or QR codes .... you may decide on a folder/directory with a series of individual files therein, you may package up lots of elemental files into a .zip / .jar file. Like I said at the start, huge subject, no single solution.

Very often on our programming courses, we'll look at customer's individual data requirements and help guide that customer through the start of the process to work out his various formats - after all, this comes very much at the start of the UML design process - "who provides the data, what's done with it, what are the results for whom".