RE: [htdig] How do I access parse_doc.pl.gz?


Subject: RE: [htdig] How do I access parse_doc.pl.gz?
From: Wayne Larmon (wayne@scrounge.org)
Date: Mon Dec 13 1999 - 11:24:32 PST


> -----Original Message-----
> From: Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
> Sent: Monday, December 13, 1999 12:59 PM
> To: wayne@scrounge.org
> Cc: htdig@htdig.org
> Subject: Re: [htdig] How do I access parse_doc.pl.gz?
>
>
> According to Wayne Larmon:
> > I'm trying to index pdf files. I'm using htdig 3.1.4 on Mandrake 6.1.
> >
> > I first tried Acroread. Acroread 4.0 fails with a "segmentation fault"
> > problem. Acroread 3.0 indexes, but the text in the search
> results is binary
> > gibberish.
> >
> > I then decided to try xpdf. I got the xpdf binaries
> downloaded, but now I'm
> > stuck on accessing parse_doc.pl from your
> > http://www.htdig.org/files/contrib/parsers/ directory because
> is is stored
> > as parse_doc.pl.gz.
> >
> > "gunzip parse_doc.pl.gz" gives this error:
> >
> > gunzip: parse_doc.pl.gz: not in gzip format
> >
> > So how do I access parse_doc.pl.gz?
>
> As was pointed out, the uncompressed version is also in the contrib
> subdirectory of the htdig source tree you untarred.

I got them unzipped by using WS_FTP to download instead of trying to
download with Internet Explorer 5.

> With 3.1.4, you
> can do better than parse_doc.pl, though. There is the contrib/conv_doc.pl
> script, which makes use of the new external converter feature. See the
> comments in the script for an example of how you use it.
>
> The benefit? More consistent parsing. parse_doc.pl dealt with
> punctuation
> a little differently than the internal parsers did, which led to some
> inconsistencies (strange non-words winding up in the database). By using
> an external converter, you're feeding HTML or plain text back into one of
> the internal parsers, so your documents get parsed the way text/plain or
> text/html documents are.

I just tried conv_doc.pl with htdig 3.1.4, configured as the comments in the
conv_doc.pl script indicate. I'm using it with Xpdf 0.90 X86 Linux 2.0
binaries. (http://www.foolabs.com/xpdf/download.html) I also downloaded and
installed catdoc (http://www.fe.msk.ru/~vitus/catdoc/) for converting
Microsoft Word documenta. Htdig indexed and htsearch retrieves the sample
.pdf and .doc files I tried.

> The FAQ should probably be updated now to reflect this.

It really is quite easy to configure this, so you should make sure that this
is covered in the FAQ. Along with a strong reccomendation to not fool with
Adobe Acroread and to use Xpdf.

Thanks. This is another example of how htdig support is much better than
the support for any commercial program that I've used.

Wayne Larmon

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Dec 13 1999 - 11:38:57 PST