Python timing - when to use a list, and when to use a generator

Archive - Originally posted on "The Horse's Mouth" - 2012-06-16 10:10:53 - Graham Ellis

If you're going to cook dinner for the family, it's much more efficient to cook everyone's meal together. If you're a family of four, you won't want to cook four separate pans of carrots, but rather you'll want to cook them all together. But now imagine you're holding a carrot street party, and providing carrots through the day for four hundred rather than four people. You could hire in (or purchase) special resources, but you're much more likely to adopt some sort of batch system. You won't cook all the carrots at once - you're pans aren't big enough, and even if they were, you've not got the capacity to store all of the cooked carrots until they're needed for serving.

The same thing applies when you're programming. In a program with a small amount of data, it's quicker and easier to handle it all via a list, but as the data size grows so would the amount of memory you need for a list, and it becomes quicker to handle the data as it becomes available - using a production line type approach known as an iterator or a generator.

Grand theory - how does that work in practise? In a language like Python, you can use a generator function. I've written about those before (examples HERE and HERE if you want to refer back). In Lua, you would use co-routines, and in any other language you could roll your own. But the question remains "Does it really make a difference?", and I thought I would try it out to see.

Let's look at the tools I'll use:

• Seven log files off our server, for periods from a minute (43 accesses) to a year (nearly 30 million accesses):

  munchkin:pyjun12 grahamellis$ wc -l ac[a-z]* | sort -bn

         43 acminute

       2671 achour

      79015 acday

     533818 acweek

     992112 acfortnight

    2481275 acmonth

   29775300 acyear

• Two programs dave - which stores a list and jenny - which processes each record as it reads it, using a generator. Both programs produce identical output files when run on the same input; I've checked this using the Unix diff command.

• Python's resource module, which is supplied with the standard Unix / Linux distributions, but needs to be imported into your program. See python documentation for full details.

The code I have added at the end of both dave and jenny, to report on resources, is as follows:

  import resource

  used = resource.getrusage(resource.RUSAGE_SELF)

  print "Maximum resident size:",used.ru_maxrss

  print "Run time:",used.ru_utime + used.ru_stime

  print "Page faults:",used.ru_minflt + used.ru_majflt

So - how does that run?

Maximum memory size while running:

Duration	list	generator
Minute	4116480	4112384
Hour	4837376	4108288
Day	30355456	4108288
Week	181055488	4108288
Fortnight	333402112	4112384
Month	837042176	4108288
Year	3351420928	4096000

Run time (sum of system and user time):

Duration	list	generator
Minute	0.055345	0.056894
Hour	0.064015	0.062101
Day	0.355556	0.296621
Week	2.088976	1.718086
Fortnight	3.833132	3.145695
Month	9.516904	7.700457
Year	353.116987	103.592527

Page faults (sum of minor and major):

Duration	list	generator
Minute	1479	1479
Hour	1653	1476
Day	8044	1476
Week	45638	1476
Fortnight	83796	1478
Month	209181	1476
Year	9143068	1491

With a generator, the amount of memory used remains around 4 Mbytes level no matter how much data is thrown at the program, but reading all the data into a list before processing any of it starts at around the same 4 Mbytes and rises to over 3 Gbytes - not really a surprise, since the data file's well over 8 Gbytes in size for the year. You'll notice too how the run time was roughly the same when the data set was small, but when using a list the application became significantly slower than the generator equivalent as the data set got large. In fact, best advice is always use a generator of the data set may get large - if the data set is small, a generator is almost as fast and the process is always quick. But if the data set is large, using a list will be much slower - and that's a slowing down of what's already a long process.

As a final thought ... look at the "page fault" table. Time shared operating systems such as linux run programs in "pages" and load and store pages from memory to allow for effective concurrency. The page fault figure that I've quotes tells me how many times my program has said "oops - I'll have go go off and find that page", and when I'm storing the whole list that's a very large number. With a generator, the figure's virtually unchanging - once the program's in memory it's ticking over sweetly and can carry on even if we throw a decade worth of data at it. I would expect severe problems if I through a decade of data at my list example ...

One of the very first Python Courses I delivered was to a company handling a huge data set, and by changing their lists to generators in existing code we made a huge difference for them. I had delegates from the same organisation on last week's Python course, and they still have very large data sets indeed. Thus this follow up blog, to help them put figures on - at least - a benchmark piece of code and data.

Main Content

Python timing - when to use a list, and when to use a generator