[Sigia-l] Common word compilation app suggestion

Andrew McNaughton andrew at scoop.co.nz
Thu Jun 13 15:01:10 EDT 2002


On Thu, 13 Jun 2002, Sean Lawrence wrote:

> Does anyone know of a utility or application that can go through a document
> and find comminly used words?  I'd imagine there is a way to do it with grep
> but I'm not quite talented enough to create the proper string to do that.

This perl script might be all you need:

----- wordcount.pl -------
#!/usr/bin/perl
use strict;

#count words from stdin
my %count;
while (<>) {
  my @words = m/([a-z][a-z']*[a-z]|[a-z])/ig;
  foreach my $word (@words) {
    $count{lc $word}++;
  }
}

# list frequencies to stdout
foreach my $word (sort {$count{$b} <=>
$count{$a}} keys %count) {
  print "$word: $count{$word}\n";
}
--------------------------

This script defines words as any sequence of latin letters which may be
broken by an apostrophe (eg they're) but no other character (eg
pseudo-science is treated as two words).

Andrew McNaughton




More information about the Sigia-l mailing list