[htdig] External Parser/Converter Ignored?


Subject: [htdig] External Parser/Converter Ignored?
From: Jochen.Munz@sued-data.de
Date: Thu Dec 16 1999 - 01:04:07 PST


Hi there,

after reading FAQ etc., I still can't get PDF-indexing to work.
I use the "parse_doc.pl" parser, located in /opt/htdig/bin. The perl script is
correctly configured and "pdftotext/pdfinfo" are in place.

My config file looks like this:
external_parsers: application/pdf /opt/htdig/bin/parse_doc.pl
max_doc_size: 2000000 #just to be sure

When I run "rundig -vvv" I get the following:

Header line: HTTP/1.1 200 OK
Header line: Server: Netscape-Enterprise/3.6 SP2
Header line: Date: Thu, 16 Dec 1999 08:57:11 GMT
Header line: Content-type: application/pdf
Header line: Last-modified: Mon, 29 Nov 1999 18:02:17 GMT
Translated Mon, 29 Nov 1999 18:02:17 GMT to 29 Nov 1999 18:02:17 (99)
And converted to Mon, 29 Nov 1999 18:02:17
Header line: Content-length: 26620
Header line: Accept-ranges: bytes
Header line: Connection: close
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 2044 from document
Read a total of 26620 bytes
 size = 26620

So the PDF is served, and read in completely. But the external parser is not
triggered. I even added a simple "touch /var/tmp/dummyfile" to the beginning of
the perl-script. Started from the shell, the file is touched - but not when
htdig runs.
This leaves me with a not-indexed PDF:
(htmerge) Deleted, no excerpt: 2/http://myserver/pdf/online.pdf

If I remove the "external_parsers" line the internal PDF-parser is triggered, so
 the content-type "application/pdf" seems to be recognized.

Any help would be greatly appreciated.

jochen

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Dec 16 1999 - 01:17:50 PST