Practical Extraction and Reporting - using Python and Extreme Programming
Archive - Originally posted on "The Horse's Mouth" - 2011-10-14 10:07:38 - Graham Ellis
Forums provide an opportunity for people to express their views, add their comments on to others, and post up their information. And as such they can provide a wonderful opportunity for people to get off topic messages onto public readable forums on the Internet. My mailbox contains adverts for pharmaceutical products, get-rich-quick schemes, Books on Steve Jobs (this week), overseas graduate programs, Crocuses, Home Security Systems, dating services, airline tickets and more ... and given half a chance, these same people who, unsolicited, pester me by email would love to advertise on the forum and pester people there too. To keep the wood visible amongst the trees, we limit signups on "The Coffeeshop" to those people who have a genuine interest, and who will post about the issues for which the forum exists. We still get plenty of requests for signup, but our vetting process is such that very few of the "spammers" or rather Wannabe Spammers actually manage to get as far as posting. But it's wasteful of our time, and we're always looking to improve our tools to help us spot the spammers quickly; recently, I added in extra logging of signup requests to help us look at them in a "pageview" mode, and we've now come to the reporting requirement to look at the data that's building up to help keep us even better informed for the future.
So ... the specification for the program and of the requirement looks a bit wooly. And I decided to apply some of the techniques of "Extreme Programming" to the task - writing a short story as to what we wanted - "We would like to be able to count up how many spanners come from wehere so that we can tell which places are the worst / most likely" and then tackle it through a spike solution where I wrote experimental code to see how an answer would look. I selected Python for the task (an excellent language for the job, and the language I've been teaching this week) ... and off I headed.
The story turns out to be, as I start coding, to convert data such as:
1 LV Haus finanzieren andrahartwick@gmail.com 91.224.246.15 Thu, 13 Oct 2011 06:26:34 +0100
1 CN cabinet519 zhaominyu15@163.com 113.231.181.142 Thu, 13 Oct 2011 06:26:44 +0100 Shenyang
into results like:
RU 41 Russian Federation
CN 38 China
DE 34 Germany
US 17 United States
UA 16 Ukraine
PL 9 Poland
LV 8 Latvia
etc
and then expands that if necessary (in fact a separete "story") by zone:
CN 38 China
Beijing 18
[unknown] 4
Guangzhou 4
Putian 3
Shenyang 2
Shanghai 2
Jinan 2
Nanjing 1
Wuhan 1
Qingdao 1
Now that I have got to that point in my exploration of the data, if I needed more I would be refactoring - taking what I have learned and recoding it to make it maintainable. You can see the code [here] with some quite notable comments pointing out its shortcomings ready for the refacoring exercise if that even comes (and if you want to run the program yourself, there's a data sample [here]
I'm sharing this example on our web site under our "Data Munging in Python" heading - for even in its raw form it's a good example of some of the techniques commonly used ... in the source, you'll find coding samples of:
• Regular Expressions (to match patterns in data and extract from them)
• Command Line handling (we've used a -v option to select the versbose / by city report)
• Dictionaries (to keep count by countries as we read the data file
• The urllib2 module (to read a web page from a remote server - the ISO country code lookup!)
• Checking whether a file exists (via os.path.exists)
• routing non-data output so stderr (via sys.syderr)
• lambda (to provide single line functions)
• read (to slurp an entire file into a variable)
• title (to take a country name that's SHOUTED AT YOU and reduce it to more manageable speech!)
Truely, so much of the power of any language comes not so much from the power of individual features, but rather from the power of using them in combination, and from reseaching, refactoring and reusing the code that uses those features.