Re: [htdig] Help indexing .pdf files, please.


Subject: Re: [htdig] Help indexing .pdf files, please.
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Feb 07 2000 - 13:05:50 PST


According to Stan Brown:
> Now I have a set of files from one vendor which consiit of some (I
> think) failry complex pdf documents. They for example have a lot of
> multi-page documents, and coullums.
>
> I have tried using acrobat 4, the conv_doc.pl script, and the
> parse_doc.pl script, all with thier own sets of problems. With acrobat4
> I get what appears to be good extraction for some files, and nothing
> whatsover for others. I have observed that one of the ones I am not
> getting naything on is a multipage document.

Multi-page PDFs are not a problem. I've indexed many of them. Acrobat 4,
however, is a very poor choice for indexing PDFs. It's very buggy, and
crashes at the drop of a hat. Acrobat 3 seems to be more reliable.

> With parse_doc I ge errors like:
>
> External parser error in line: without disrupting the other modules
> in the system
>
> with conv_doc, I get errors about some close faiures (?).
>
> Can anyone give me some advice on how to make this work?

conv_doc.pl will only work with htdig 3.1.4 or later. Any errors you
get from it likely originate from the pdftotext program. If pdftotext
is giving you error messages, you quite likely have a defective PDF on
your hands.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Feb 07 2000 - 13:15:00 PST