htdig: wordfreq


Iosif Fettich (ifettich@netsoft.ro)
Tue, 17 Nov 1998 19:07:12 +0200 (EET)


> Now I'll have to try using contrib/wordfreq/ or Geoff's method. I assume
> halving the database size would not only save disk space, but speed
> searches.
>
> >used "cut -f 1 db.wordlist | uniq -c | sort -r" to determine how many
> >documents each word was in, then I took the top 500 and edited the list.
> ^^^^^^^^
> Edited db.worklist, I assume?

I attached a slightly updated version of the contributed wordfreq.pl
program, to deal with the new Berkeley DB databases. Use it maybe like

wfdb.pl /yourpath/db.wordlist.db | sort -n

to get the list of word frequencies; cut the numbers, keep the words and
add it to your badwords list.

Iosif Fettich

#!/usr/local/bin/perl

use GDBM_File;
use BerkeleyDB;

##
## wordfreq.pl
## (C) 1995 Andrew Scherpbier <andrew@sdsu.edu>
##
## Will generate a list of words and how frequently they are used
##
## updated to deal with Berkeley db files 1998 Iosif Fettich <ifettich@netsoft.ro>
##


$filetype = 'DB';

if (not defined $ARGV[0] or defined ($ARGV[1]) and $ARGV[1] !~ /g/i) {
   print "\n\nThis program is used in conjunction with HtDig \n";
   print "to determine the frequency of words in a database containing word references.\n\n";
   print "Usage: $0 filename (to use a Berkeley db2 wordlist)\n";
   print " $0 filename g[dbm] (to use a GDBM wordlist)\n\n\n";
   exit;
}

$filename = $ARGV[0];

if ($filename =~ /gdbm$/i or $ARGV[1] =~ /g/i) {
  $filetype = 'GDBM';
}

if ($filetype eq 'GDBM') {
   tie %worddb, 'GDBM_File', $ARGV[0], GDBM_READER, 0
       or die "Unable to open $ARGV[0] $!";
} else {
   tie %worddb, 'BerkeleyDB::Btree',
              -Filename => $filename,
              -Flags => DB_RDONLY
       or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;
}

while (($key, $value) = each %worddb)
{
    $length = length($value) / 20;
    $total = 0;
    foreach $i (0 .. $length - 1)
    {
        ($count, $id, $weight, $anchor, $location) =
            unpack("i i i i i", substr($value, $i * 20, 20));
        $total += $count;
    }
    print "$total\t$key\n";
}

Nry^bf)+-jgb-.yۜ{Zx^)(ۧrܨ~^,



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:49 PST