Main Content

Some tips and techniques for huge data handling in Python

Archive - Originally posted on "The Horse's Mouth" - 2013-05-15 01:04:13 - Graham Ellis

Python's an excellent tool for handling huge data sets and long-running programs, although some of the elements of the language that you'll use for such work aren't exactly things we teach on our Introduction to Python courses. Yesterday, however, I was teaching an Intermediate Python course, and had a chance to cover a number of these things.

Some elements of note:

a) Progress logging to stderr:

  tracey = sys.stderr

and

  percent = 100.0 * counter / totalwork
  report = "Here we go ... {:8.2f}% of the way\r".format(percent)
  tracey.write(report)
  tracey.flush()


in my code. The output's to stderr rather than stdout so that it won't be redirected to file if there's any redirection done with >. It's output using \r rather than \n to ensure that reports overwrite one another, and I've added a flush so that the output doesn't hang around in buffers but is displayed straight away, even though there are no newlines (\n)s.

b) Reprogramming of ^C

  signal.signal(signal.SIGINT,sighandler)

which causes ^C to run a handler:

  def sighandler(which, frame):
  
    # This could be run at ANY point ... don't do much in here!
  
    global interim
    interim = 1
  
    # If ^C is pressed twice within a second, really do kill it!
  
    now = time.time()
    sighandler.recent += 1
    if now - sighandler.recent < 1:
      sys.exit(0)
    sighandler.recent = now


I've tried to do as little as possible in this handler, as the code could be called a just about any time. It tries to do little more than set a flag to indicate that an interim report is to be produced at an appropriate point. However, I have added extra code to pick up ^C twice in a second - if someone's hammering the keyboard then, sure, let the program exit.

Complete source code [here].

If you want ^C to generate an exception, see [here]. That's not a suitable trap where we want to resume execution straight away, as exceptions jump out from a piece of trapping code