RE: [htdig] PDF problems


Subject: RE: [htdig] PDF problems
From: The Melia Family (melias@hypermax.net.au)
Date: Wed Jan 03 2001 - 13:22:20 PST


Thank you Gilles, I've gone through your suggestions, and it now seems to
work. Funny thing is, I do not know which config option I changed to fix
it. The manual conversion of a small pdf file is below. I had been
indexing a small section of my site as a test, now that it seems to be
working, I will reindex the entire site and see if it still works, I am
using the -vvv option in the event I need it debugged.

I have noticed that the output from rundig seems to look the same wether the
database is being newly build, or updated. AT what point does htdig realise
it already has a page in it's database, assuiming I am running an update to
HTDIG?

Regards,
Tony.

[root@Linux /root]# conv_doc.pl
/var/www/html/htdig/mx59pro/manual/english/content.pdf
<HTML>
<head>
<title>PDF Document </title>
</head>
<body>
Chapter 1 Overview

Chapter 2 Hardware Installation

Chapter 3 Software Installation

Chapter 4 Award BIOS

</body>
</HTML>
[root@Linux /root]#

-----Original Message-----
From: Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
Sent: Thursday, January 04, 2001 1:19 AM
To: The Melia Family
Cc: htdig@htdig.org
Subject: Re: [htdig] PDF problems

According to The Melia Family:
> I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
> files. I have included my config & -vv output below. I have no robots.txt
> file, and my max_doc_size is now 10M (one test .pdf file is only 27K and
it
> also fails), as well as not rejecting pdf as an extension.
> I am using the latest xpdf with pdftotext, as well as the latest parse_doc
> and conv_doc scripts.
>
> I can manually pdftotext the pdf files and they do contain real text, not
> just images, I can also run parse_doc and conv_doc.plthey produce proper
> text. WHen I do a rundig, I get a 'URL rejected' message, I do not know
> why, this (I presume) leads to a Deleted No Excerpt message and the file
(or
> any pdf file) is not indexed. Any suggestions??

The output from htdig isn't verbose enough to pinpoint the problems,
but there is more than one problem here. First of all, I always strongly
recommend conv_doc.pl or doc2html.pl over parse_doc.pl. The latter has
been the source of too many problems in the past.

Secondly, the rejected URLs and the "Deleted, no excerpt:" messages
are two unrelated issues. URLs that are rejected by htdig at this
stage (level 1 or level 2) will not even be seen by htmerge. For the
rejection of URLs, see http://www.htdig.org/FAQ.html#q5.27 for how to
deal with this. There isn't enough information in the htdig output or
the excerpts of your htdig.conf you sent to be certain of what the reason
for rejection is. However, the htdig output you sent seems to suggest
a different start_url value than the one in your htdig.conf excerpt, so
I suspect that the reason for the rejection is that the parent directory
of the one you're indexing is not in the limits of limit_urls_to, which
is a reasonable thing for a test case such as this.

The "Deleted, no excerpt:" messages are usually as a result of documents
that contain no indexable text, or external parsers that don't emit a
usable "h" record (one more reason to use an external converter rather
than an external parser). The challenge is to get to the bottom of why
this happens in each individual case. You did run the scripts manually,
which is what I usually recommend, but are you sure parse_doc.pl put out
a proper "h" record and not just "w" records? Did you try htdig with
conv_doc.pl instead, using the correct syntax for external_parsers as
shown in conv_doc.pl's comments?

Finally, I noticed you're getting the directory indexed multiple times
due to Apache's fancy indexing feature. You can avoid this by adding
"?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to exclude_urls (without the
quotes) to suppress the alternately sorted views of the directory.

--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 03 2001 - 13:34:26 PST