Re: [htdig] Problem indexing PDF files

Gilles Detillieux (
Thu, 8 Jul 1999 12:05:32 -0500 (CDT)

According to Joakim Wiberg (HMS):
> I try to index a PDF file and I get the following error.
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL:
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.

I can see a couple problems here. First of all, unless your K00117.pdf
is exactly 100000 bytes in length, it's being truncated. You'll likely
need to boost your max_doc_size attribute to something larger than your
biggest PDF to avoid truncation.

Secondly, the error message above seems to come from the
script. For some reason, your PDF does not have a magic number that the
script recognises, so it's rejecting it. Try running pdftotext on it
directly, to see if pdftotext can handle it. If that works, there's a
discrepancy between what pdftotext and recognise as a valid
PDF, and I'd probably need a sample of such a PDF to fix the problem
in If pdftotext can't handle K00117.pdf directly, you're
not going to be able to index it in any case -- not with this external
parser anyway. In this case, you'll need to see what the problem is.
If acroread can handle the PDF, and pdftotext can't, I guess it's Derek
Noonberg's problem. Is the PDF encrypted or encoded in some way or other?


P.S. I'll be away tomorrow, so I probably won't get to look into this
further until Monday.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Thu Jul 08 1999 - 09:21:56 PDT