Re: [htdig] How do I access parse_doc.pl.gz?


Subject: Re: [htdig] How do I access parse_doc.pl.gz?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Dec 13 1999 - 09:58:30 PST


According to Wayne Larmon:
> I'm trying to index pdf files. I'm using htdig 3.1.4 on Mandrake 6.1.
>
> I first tried Acroread. Acroread 4.0 fails with a "segmentation fault"
> problem. Acroread 3.0 indexes, but the text in the search results is binary
> gibberish.
>
> I then decided to try xpdf. I got the xpdf binaries downloaded, but now I'm
> stuck on accessing parse_doc.pl from your
> http://www.htdig.org/files/contrib/parsers/ directory because is is stored
> as parse_doc.pl.gz.
>
> "gunzip parse_doc.pl.gz" gives this error:
>
> gunzip: parse_doc.pl.gz: not in gzip format
>
> So how do I access parse_doc.pl.gz?

As was pointed out, the uncompressed version is also in the contrib
subdirectory of the htdig source tree you untarred. With 3.1.4, you
can do better than parse_doc.pl, though. There is the contrib/conv_doc.pl
script, which makes use of the new external converter feature. See the
comments in the script for an example of how you use it.

The benefit? More consistent parsing. parse_doc.pl dealt with punctuation
a little differently than the internal parsers did, which led to some
inconsistencies (strange non-words winding up in the database). By using
an external converter, you're feeding HTML or plain text back into one of
the internal parsers, so your documents get parsed the way text/plain or
text/html documents are.

The FAQ should probably be updated now to reflect this.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Dec 13 1999 - 10:12:02 PST