Main Content

Finding all the unique lines in a file, using Python or Perl

Archive - Originally posted on "The Horse's Mouth" - 2012-03-20 19:41:23 - Graham Ellis

A question - how do I process all the unique lines from a file in Python? Asked by a delegate today, solved neatly and easily using a generator which means that there's no need to store all the data - unique values can be passed back and processed onwards as they're found. This is fantastic news if the input isn't really a file, but is some other reporting data source that's slower and you would like to get answers even as the data's still flowing in.

  def unique(source):
    sofar = {}
    for val in open(source):
      if not sofar.get(val):
        yield val.strip()
        sofar[val] = 1
  
  for lyne in unique("info.txt"):
    print lyne


[complete source]. Neat, isn't it? I love Python! And to test that love, I thought I would answer the same question in Perl:

  sub unique {
    open FH,$_[0];
    my %sofar;
    my @uvals;
    while (my $line = ) {
      if (! $sofar{$line}) {
        $sofar{$line} = 1;
        push @uvals,$line;
      }
    }
    return @uvals;
  }
  
  foreach $lyne (unique("info.txt")) {
    print $lyne;
    }


[complete source]. A little longer, and as Perl doesn't have a generator as such, I was tempted to write the code to only return the unique list once the whole incoming data flow had been received. But a little more thought let me produce a generator-line alternative:

  sub unique {
    $static or open FH,$_[0];
    $static = 1;
    while (my $line = ) {
      if (! $sofar{$line}) {
        $sofar{$line} = 1;
        return $line;
      }
    }
    return "";
  }
  
  while ($lyne = unique("info.txt")) {
    print $lyne;
    }


[complete source]. Actually rather neat, but relying on the use of a global variable to note the state of the "generator" routine, and a need to take care to flag the end of the data. Careful code examination will show you that the return ""; is actually redundant, as Perl returns the result of the last expression evaluated, which is false when the loop exits. However, start applying tricks like this and you're getting into code that's going to be hard to maintain.

Truth be know - I love Perl too. See our Perl Courses and Python Courses. Happy to teach you either - to help you use their strengths and write good maintainable code in either.