Re: [htdig] pdf indexing question

Subject: Re: [htdig] pdf indexing question
From: Gilles Detillieux (
Date: Tue Jul 25 2000 - 09:19:34 PDT

According to Matthew R. MacIntyre:
> I'm having a problem indexing pdf files. The htdig phase seems to work
> fine, no errors are produced, but when the htmerge phase is run, this error
> always shows up:
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
> I'm not really sure how to go about fixing this problem. Here's what I have
> in my configuration file:
> external_parsers: application/msword->text/html /usr/local/htdig/bin/ \
> application/postscript->text/html /usr/local/htdig/bin/ \
> application/pdf->text/html /usr/local/htdig/bin/
> I was trying to use the script instead of the
> script for a little while, but I kept getting many errors about acroread not
> showing up, and how the pdf files could not be repaired.

Looks like you're dealing with a few separate problems here.

Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use
As long as you're running 3.1.4 or later, you should use or, rather than -- they just work better.

Also, errors about PDF files that couldn't be repaired would come from
acroread as well. These are caused by max_doc_size not being set high
enough for your largest PDF documents. See FAQ 5.1 & 5.2.

Finally, you should run /usr/local/htdig/bin/, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see what output you get, if any. It may be that the PDF contains only
image data, and no indexable text, or it may be that isn't
configured with the right path to the pdftotext executable.

I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file. A backslash is required
at the very end of all but the last line in a multi-line definition.

If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running
on your PDFs does produce indexable text, and that the PDFs are not
disallowed by your robots.txt file, then you shouldn't get the no excerpt
error above.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:18:01 PDT