Main Content

Cottage industry or production line data handling methods

Archive - Originally posted on "The Horse's Mouth" - 2006-06-07 06:20:11 - Graham Ellis

If you're running a cottage industry, for efficiency's sake you'll run the first process on each of your raw components first, and store the partially-completed elements in a basket as they're processed. When you've completed that first process, you'll then apply the second process to each element in turn, then the third process, and so on. For small scale production, that's a much MUCH better use of your time and resources that setting up all processes in what's probably a very small area and trying to take components through from end to end.

But if your level of throughput is to be several magnitudes greater, the cottage industry approach doesn't work and you'll want, in a bigger area, to set up a production line. You'll have more overheads, as each stage of the line will need someone to operate it, but you'll save on the need to store large numbers of intermediate components, and you'll save on the time spend putting down and picking up components at each stage too.

Data handling flows can resemble a cottage industry or a production line .... and it depends on how much data you have as to which approach will be the most effective.

In Shell Programming, using a series of operations each of which reads from a file, and saves out to a file with > style redirects, is your cottage industry approach. Using a pipe - | is more production line, with a buffer usually of 4k between the processes.

In Python, functions such as range and readlines return complete lists which you subsequently use - cottage industry. Alternatives such as xrange and xreadlines are generators which are running in parallel with their calling code and so are your production line. And you can write your own generator functions; you can spot them in existing code if you see the yield keyword.

As well as using the cottage industry / production line comparison, I also compare the one-at-a-time approach to being akin to filling up a reservoir from one process, then using the reservoir until it's empty from another. If you have a huge amount of data, then you're likely to overflow your reservoir and have your program fail. However, running the processes at the same time is rather like joining them with a pipe, with a tap that is turned on and off each time a new chunk of data is required. This is how I've successfully trained clients who have data files up to 10Gbytes in size to handle their data in easily written Python scripts.