Some tips and techniques for huge data handling in Python
Archive - Originally posted on "The Horse's Mouth" - 2013-05-15 01:04:13 - Graham EllisPython's an excellent tool for handling huge data sets and long-running programs, although some of the elements of the language that you'll use for such work aren't exactly things we teach on our Introduction to Python courses. Yesterday, however, I was teaching an Intermediate Python course, and had a chance to cover a number of these things.
Some elements of note:
a) Progress logging to stderr:
tracey = sys.stderr
and
percent = 100.0 * counter / totalwork
report = "Here we go ... {:8.2f}% of the way\r".format(percent)
tracey.write(report)
tracey.flush()
in my code. The output's to stderr rather than stdout so that it won't be redirected to file if there's any redirection done with >. It's output using \r rather than \n to ensure that reports overwrite one another, and I've added a flush so that the output doesn't hang around in buffers but is displayed straight away, even though there are no newlines (\n)s.
b) Reprogramming of ^C
signal.signal(signal.SIGINT,sighandler)
which causes ^C to run a handler:
def sighandler(which, frame):
# This could be run at ANY point ... don't do much in here!
global interim
interim = 1
# If ^C is pressed twice within a second, really do kill it!
now = time.time()
sighandler.recent += 1
if now - sighandler.recent < 1:
sys.exit(0)
sighandler.recent = now
I've tried to do as little as possible in this handler, as the code could be called a just about any time. It tries to do little more than set a flag to indicate that an interim report is to be produced at an appropriate point. However, I have added extra code to pick up ^C twice in a second - if someone's hammering the keyboard then, sure, let the program exit.
Complete source code [here].
If you want ^C to generate an exception, see [here]. That's not a suitable trap where we want to resume execution straight away, as exceptions jump out from a piece of trapping code