Re: [htdig] Help indexing .pdf files, please.

Subject: Re: [htdig] Help indexing .pdf files, please.
From: Gilles Detillieux (
Date: Mon Feb 07 2000 - 13:05:50 PST

According to Stan Brown:
> Now I have a set of files from one vendor which consiit of some (I
> think) failry complex pdf documents. They for example have a lot of
> multi-page documents, and coullums.
> I have tried using acrobat 4, the script, and the
> script, all with thier own sets of problems. With acrobat4
> I get what appears to be good extraction for some files, and nothing
> whatsover for others. I have observed that one of the ones I am not
> getting naything on is a multipage document.

Multi-page PDFs are not a problem. I've indexed many of them. Acrobat 4,
however, is a very poor choice for indexing PDFs. It's very buggy, and
crashes at the drop of a hat. Acrobat 3 seems to be more reliable.

> With parse_doc I ge errors like:
> External parser error in line: without disrupting the other modules
> in the system
> with conv_doc, I get errors about some close faiures (?).
> Can anyone give me some advice on how to make this work? will only work with htdig 3.1.4 or later. Any errors you
get from it likely originate from the pdftotext program. If pdftotext
is giving you error messages, you quite likely have a defective PDF on
your hands.

