Re: [htdig] Problem indexing PDF files


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 8 Jul 1999 12:05:32 -0500 (CDT)


According to Joakim Wiberg (HMS):
> I try to index a PDF file and I get the following error.
>
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
>
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.

I can see a couple problems here. First of all, unless your K00117.pdf
is exactly 100000 bytes in length, it's being truncated. You'll likely
need to boost your max_doc_size attribute to something larger than your
biggest PDF to avoid truncation.

Secondly, the error message above seems to come from the parse_doc.pl
script. For some reason, your PDF does not have a magic number that the
script recognises, so it's rejecting it. Try running pdftotext on it
directly, to see if pdftotext can handle it. If that works, there's a
discrepancy between what pdftotext and parse_doc.pl recognise as a valid
PDF, and I'd probably need a sample of such a PDF to fix the problem
in parse_doc.pl. If pdftotext can't handle K00117.pdf directly, you're
not going to be able to index it in any case -- not with this external
parser anyway. In this case, you'll need to see what the problem is.
If acroread can handle the PDF, and pdftotext can't, I guess it's Derek
Noonberg's problem. Is the PDF encrypted or encoded in some way or other?

Gilles

P.S. I'll be away tomorrow, so I probably won't get to look into this
further until Monday.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Jul 08 1999 - 09:21:56 PDT