Gilles Detillieux (firstname.lastname@example.org)
Thu, 8 Jul 1999 12:05:32 -0500 (CDT)
According to Joakim Wiberg (HMS):
> I try to index a PDF file and I get the following error.
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.
I can see a couple problems here. First of all, unless your K00117.pdf
is exactly 100000 bytes in length, it's being truncated. You'll likely
need to boost your max_doc_size attribute to something larger than your
biggest PDF to avoid truncation.
Secondly, the error message above seems to come from the parse_doc.pl
script. For some reason, your PDF does not have a magic number that the
script recognises, so it's rejecting it. Try running pdftotext on it
directly, to see if pdftotext can handle it. If that works, there's a
discrepancy between what pdftotext and parse_doc.pl recognise as a valid
PDF, and I'd probably need a sample of such a PDF to fix the problem
in parse_doc.pl. If pdftotext can't handle K00117.pdf directly, you're
not going to be able to index it in any case -- not with this external
parser anyway. In this case, you'll need to see what the problem is.
If acroread can handle the PDF, and pdftotext can't, I guess it's Derek
Noonberg's problem. Is the PDF encrypted or encoded in some way or other?
P.S. I'll be away tomorrow, so I probably won't get to look into this
further until Monday.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Jul 08 1999 - 09:21:56 PDT