Subject: Re: [htdig] pdf indexing question
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Tue Jul 25 2000 - 09:19:34 PDT
According to Matthew R. MacIntyre:
> I'm having a problem indexing pdf files. The htdig phase seems to work
> fine, no errors are produced, but when the htmerge phase is run, this error
> always shows up:
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
> I'm not really sure how to go about fixing this problem. Here's what I have
> in my configuration file:
> external_parsers: application/msword->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/postscript->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/pdf->text/html /usr/local/htdig/bin/conv_doc.pl
> I was trying to use the parse_doc.pl script instead of the conv_doc.pl
> script for a little while, but I kept getting many errors about acroread not
> showing up, and how the pdf files could not be repaired.
Looks like you're dealing with a few separate problems here.
Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use parse_doc.pl.
As long as you're running 3.1.4 or later, you should use conv_doc.pl or
doc2html.pl, rather than parse_doc.pl -- they just work better.
Also, errors about PDF files that couldn't be repaired would come from
acroread as well. These are caused by max_doc_size not being set high
enough for your largest PDF documents. See FAQ 5.1 & 5.2.
Finally, you should run /usr/local/htdig/bin/conv_doc.pl, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see what output you get, if any. It may be that the PDF contains only
image data, and no indexable text, or it may be that conv_doc.pl isn't
configured with the right path to the pdftotext executable.
I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file. A backslash is required
at the very end of all but the last line in a multi-line definition.
If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running conv_doc.pl
on your PDFs does produce indexable text, and that the PDFs are not
disallowed by your robots.txt file, then you shouldn't get the no excerpt
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:18:01 PDT