Main Content

Loading and saving data - Python / numpy

Archive - Originally posted on "The Horse's Mouth" - 2010-10-09 07:38:42 - Graham Ellis

If you're using big data sets in Python, you're probably using the numpy module - providing you with fast data handlers at C speed of running, and Python coding speed. But how do you load that data in? Numpy also provides a number of data handlers, data setup routines, and also a save and restore capability.

There's a very basic example at [link] where I've generated a numpy object from text (I could have used a file ...) - each row and column in the incoming text string has been placed into a row or column in the numpy array.

I've added a further example too ...

Our daily server log file comprises about 150,000 access records (so it's 30Mb to 40Mb in size) and I wanted to see how the traffic varies in each hour through the week via a graph. So that means that I needed to go through and find a piece of information from around a million records, spread over around a quarter of a gigabyte of data to get the results shown on the right. Python's quite mpressive even without numpy - that analysis took less than 10 seconds on my laptop, but later I'll be doing the same exercise to average out the data for a whole six months, and the time will start to get serious.

Numpy's save and load functions allowed me to dump out my array to a file, and to load it back in again - my 10 seconds drops to less that 1 second if I do this for a week of data (and for six months it would drop me from about four minutes down to 1 second!).

The code to convert my Python list in which I did the counting (that's another numpy extra feature) is:
  info = np.asarray(counter)
and the code to save the data to file is:
  np.save("logweek.npy",info)

When I came to run the program (again), I simply had it check if the file existed and if it did, I loaded it:
  if os.path.exists("logweek.npy"):
    info = np.load("logweek.npy")


The complete source code example is [here] ... note that it also uses matplotlib - a plotting library that's often used in association with numpy and scipy


If you're looking to save pure Python data, have a look at the Pickle and Marshall modules that are a part of the standard distribution ... or the cPickle module which is implemented in C and much quicker; this latter becomes the standard in Python 3. We have various examples around - [marshall example] and a [post on pickling].