[htdig] Problems with searching PDF

Tom Wooldridge (tomw@futuresouth.com)
Fri, 25 Jun 1999 09:17:52 -0500


        We are having a bit of a problem on our webserver. We have a
great deal of PDF content available. We raised the max_document_size
variable to 500k. I, of course, added the external parsers line and
configured the parse_doc.pl script to parse PDF files. Here is the output
I get after running htdig.

intranet01:/opt/www/htdig/bin # ./htdig -vvvv -i
<output trimmed>

---- notice that the URL is rejected here ---
url rejected: (level 1)file://tc/vol1/ol_prod/mainmenu.pdf
word: requires@676
word: acrobat@681
word: reader@684
Tag: /p>, matched -1
Tag: p>, matched -1
Tag: img src="image15.gif" width="16" height="17">, matched 18
image: http://intranet01/departments/mortgageloan/image15.gif
Tag: font
face="Arial">, matched -1
Tag: /font>, matched -1
Tag: a href="branch/tclist.htm">, matched 2
A tag: pos = 2, position = ="branch/tclist.htm">
word: list@732
word: branches@736
word: thin@740
word: client@742
Tag: /a>, matched 3
href: http://intranet01/departments/mortgageloan/branch/tclist.htm (List
of Branches on Thin Client)
resolving 'http://intranet01/departments/mortgageloan/branch/tclist.htm'

This pattern continues for all pdf file that the search engine encounters.
I am unable to get any further debugging output, so I am unable to
investigate further..

Any help is greatly appreciated.
Tom Wooldridge

To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Fri Jun 25 1999 - 06:35:29 PDT