Subject: Re: [htdig] PDF problems
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Wed Jan 03 2001 - 07:18:56 PST
According to The Melia Family:
> I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
> files. I have included my config & -vv output below. I have no robots.txt
> file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
> also fails), as well as not rejecting pdf as an extension.
> I am using the latest xpdf with pdftotext, as well as the latest parse_doc
> and conv_doc scripts.
> I can manually pdftotext the pdf files and they do contain real text, not
> just images, I can also run parse_doc and conv_doc.plthey produce proper
> text. WHen I do a rundig, I get a 'URL rejected' message, I do not know
> why, this (I presume) leads to a Deleted No Excerpt message and the file (or
> any pdf file) is not indexed. Any suggestions??
The output from htdig isn't verbose enough to pinpoint the problems,
but there is more than one problem here. First of all, I always strongly
recommend conv_doc.pl or doc2html.pl over parse_doc.pl. The latter has
been the source of too many problems in the past.
Secondly, the rejected URLs and the "Deleted, no excerpt:" messages
are two unrelated issues. URLs that are rejected by htdig at this
stage (level 1 or level 2) will not even be seen by htmerge. For the
rejection of URLs, see http://www.htdig.org/FAQ.html#q5.27 for how to
deal with this. There isn't enough information in the htdig output or
the excerpts of your htdig.conf you sent to be certain of what the reason
for rejection is. However, the htdig output you sent seems to suggest
a different start_url value than the one in your htdig.conf excerpt, so
I suspect that the reason for the rejection is that the parent directory
of the one you're indexing is not in the limits of limit_urls_to, which
is a reasonable thing for a test case such as this.
The "Deleted, no excerpt:" messages are usually as a result of documents
that contain no indexable text, or external parsers that don't emit a
usable "h" record (one more reason to use an external converter rather
than an external parser). The challenge is to get to the bottom of why
this happens in each individual case. You did run the scripts manually,
which is what I usually recommend, but are you sure parse_doc.pl put out
a proper "h" record and not just "w" records? Did you try htdig with
conv_doc.pl instead, using the correct syntax for external_parsers as
shown in conv_doc.pl's comments?
Finally, I noticed you're getting the directory indexed multiple times
due to Apache's fancy indexing feature. You can avoid this by adding
"?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to exclude_urls (without the
quotes) to suppress the alternately sorted views of the directory.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>
This archive was generated by hypermail 2b28 : Wed Jan 03 2001 - 07:30:44 PST