Re: [htdig] Problems with searching PDF

Torsten Neuer (
Fri, 25 Jun 1999 17:15:13 +0200

According to Tom Wooldridge:
> We are having a bit of a problem on our webserver. We have a
>great deal of PDF content available. We raised the max_document_size
>variable to 500k. I, of course, added the external parsers line and
>configured the script to parse PDF files. Here is the output
>I get after running htdig.
>intranet01:/opt/www/htdig/bin # ./htdig -vvvv -i
><output trimmed>
>---- notice that the URL is rejected here ---
>url rejected: (level 1)file://tc/vol1/ol_prod/mainmenu.pdf
>word: requires@676
>word: acrobat@681
>word: reader@684
>Tag: /p>, matched -1
>Tag: p>, matched -1
>Tag: img src="image15.gif" width="16" height="17">, matched 18
>image: http://intranet01/departments/mortgageloan/image15.gif
>Tag: font
>face="Arial">, matched -1
>Tag: /font>, matched -1
>Tag: a href="branch/tclist.htm">, matched 2
>A tag: pos = 2, position = ="branch/tclist.htm">
>word: list@732
>word: branches@736
>word: thin@740
>word: client@742
>Tag: /a>, matched 3
>href: http://intranet01/departments/mortgageloan/branch/tclist.htm (List
>of Branches on Thin Client)
>resolving 'http://intranet01/departments/mortgageloan/branch/tclist.htm'
>This pattern continues for all pdf file that the search engine encounters.
>I am unable to get any further debugging output, so I am unable to
>investigate further..

For all PDF or for all PDF with a file:// URL?

AFAIK ht://Dig doesn't work for file:// (which surely makes sense).


InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail:            Internet:

------------------------------------ To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Fri Jun 25 1999 - 07:35:46 PDT