Main Content

Unique word locator - Python dict example

Archive - Originally posted on "The Horse's Mouth" - 2016-03-06 08:02:23 - Graham Ellis

If a word occurs only once in my blog - all 4661 entries so far - chances are that it's a mis-spelling. And using a dict in Python, I can quickly parse a data stream with lots of text in it, isolate individual words, and see how many times each occurs.

Dictionaries are a very quick and easy way of looking up keys (they're used internally fo rvariable names in most scripting languages) so this runs really fast.

  import re
  word = re.compile(r'[A-Z]{2,}',re.I)
  wordcount = {}
  for line in open("blog"):
    words = word.findall(line)
    for item in words:
      i2 = item.lower()
      wordcount[i2] = wordcount.get(i2,0) + 1


I can then sort and output my answers. You can't sort a dict - but you can sort a list of keys

  used = wordcount.keys()
  used.sort(lambda y,x:wordcount[x]-wordcount[y])
  for item in used:
    print item, wordcount[item]


Sometimes, I mistype our dommain name "wellhousemanor" ... let's see

  WomanWithCat:f2916 grahamellis$ python uniwords | grep manor
  manor 1094
  wellhousemanor 547
  showmanor 5
  wellhousmanor 4
  theoldmanor 2
  manorsnow 2
  manordawn 2
  manorgarden 1
  wellhouesemanor 1
  wellhhousemanor 1
  manorside 1
  manorembossed 1
  manorcard 1
  manorgant 1
  greatchalfieldmanor 1
  manordaffs 1
  wellmousemanor 1
  WomanWithCat:f2916 grahamellis$



Well Mouse Manor ;-)

Complete source - [here]