Re: [htdig] alterations

Subject: Re: [htdig] alterations
From: David Adams (
Date: Fri Nov 26 1999 - 01:42:15 PST

> According to David Adams:
> > I have downloaded the script, and the xpdf and catdoc
> > utilities, and I am now using them to extend our search index to include
> > Word and PDF files. It all works well and with a bit of alteration to
> > the Perl script does exactly what I want. My thanks to the developers!
> I forgot to ask before, what were your alterations? Something very
> specific to your needs, or something worth sharing with other?
> --
> Gilles R. Detillieux E-mail: <>
> Spinal Cord Research Centre WWW:
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

Well, since you ask, I noticed two problems with PDF files on our site:

1. the titles were often meaningless, having no connection with
        the contents.

2. pdftotext outputs some spurious non-ascii gibberish that is
        then indexed.

I modified the code which outputs the title to always include the
type, and to put any extracted title in double quotes or the filename
in square brackets:

# if no title use filename from URL
if (not length($title)) {
        $title = $ARGV[2];
        $title =~ s#^.*/##;
        $title = '[' . $title . ']';
} else {
        $title = '"' . $title . '"';
print "t\t$title ($type Document)\n";

To throw away the spurious "words" I simplified the code to replace
all non-alphanumerics with spaces. I appreciate that many people would
think that too drastic:

while (<CAT>) {
        while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
                $_ .= <CAT> || break;
        $head .= " " . $_;
# s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$
# s/[\255]/-/g; # replace dashes with $
        s/\W/ /g; # replace non-alphanumeric characters with spaces
        s/\s+/ /g; # replace multiple spaces, etc. with a single space
        @fields = split; # split up line
        next if (@fields == 0); # skip if no fields (do$
        for ($x=0; $x<@fields; $x++) { # check each field if s$
                if (length($fields[$x]) >= $minimum_word_length) {
                        push @allwords, $fields[$x]; # add to list

The spurious output is nolonger indexed, but it does remain in the head,
so there is further room for improvement.

David J Adams
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You'll receive a message confirming the unsubscription.

This archive was generated by hypermail 2b25 : Fri Nov 26 1999 - 01:54:16 PST