Re: [htdig] parse_doc.pl alterations


Subject: Re: [htdig] parse_doc.pl alterations
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Fri Nov 26 1999 - 01:42:15 PST


> According to David Adams:
> > I have downloaded the parse_doc.pl script, and the xpdf and catdoc
> > utilities, and I am now using them to extend our search index to include
> > Word and PDF files. It all works well and with a bit of alteration to
> > the Perl script does exactly what I want. My thanks to the developers!
>
> I forgot to ask before, what were your alterations? Something very
> specific to your needs, or something worth sharing with other?
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

Well, since you ask, I noticed two problems with PDF files on our site:

1. the titles were often meaningless, having no connection with
        the contents.

2. pdftotext outputs some spurious non-ascii gibberish that is
        then indexed.

I modified the code which outputs the title to always include the
type, and to put any extracted title in double quotes or the filename
in square brackets:

# if no title use filename from URL
if (not length($title)) {
        $title = $ARGV[2];
        $title =~ s#^.*/##;
        $title = '[' . $title . ']';
} else {
        $title = '"' . $title . '"';
}
print "t\t$title ($type Document)\n";

To throw away the spurious "words" I simplified the code to replace
all non-alphanumerics with spaces. I appreciate that many people would
think that too drastic:

while (<CAT>) {
        while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
                $_ .= <CAT> || break;
                s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
        }
        $head .= " " . $_;
# s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$
# s/[\255]/-/g; # replace dashes with $
        s/\W/ /g; # replace non-alphanumeric characters with spaces
        s/\s+/ /g; # replace multiple spaces, etc. with a single space
        @fields = split; # split up line
        next if (@fields == 0); # skip if no fields (do$
        for ($x=0; $x<@fields; $x++) { # check each field if s$
                if (length($fields[$x]) >= $minimum_word_length) {
                        push @allwords, $fields[$x]; # add to list
                }
        }
}

The spurious output is nolonger indexed, but it does remain in the head,
so there is further room for improvement.

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Fri Nov 26 1999 - 01:54:16 PST