Re: [htdig] External Parser/Converter Ignored?


Subject: Re: [htdig] External Parser/Converter Ignored?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Dec 16 1999 - 10:30:05 PST


According to Jochen.Munz@sued-data.de:
> after reading FAQ etc., I still can't get PDF-indexing to work.
> I use the "parse_doc.pl" parser, located in /opt/htdig/bin. The perl script is
> correctly configured and "pdftotext/pdfinfo" are in place.
>
> My config file looks like this:
> external_parsers: application/pdf /opt/htdig/bin/parse_doc.pl
> max_doc_size: 2000000 #just to be sure
>
> When I run "rundig -vvv" I get the following:
[snip]
> So the PDF is served, and read in completely. But the external parser is not
> triggered. I even added a simple "touch /var/tmp/dummyfile" to the beginning of
> the perl-script. Started from the shell, the file is touched - but not when
> htdig runs.
> This leaves me with a not-indexed PDF:
> (htmerge) Deleted, no excerpt: 2/http://myserver/pdf/online.pdf
>
> If I remove the "external_parsers" line the internal PDF-parser is triggered, so
> the content-type "application/pdf" seems to be recognized.
>
> Any help would be greatly appreciated.

Just a hunch, but what is your TMPDIR environment variable set to when
you run htdig? If you don't have write access to that directory, htdig
won't be able to create the temporary file it uses to pass the document
to the parser, and, believe it or not, if that happens it silently leaves
the document without parsing it.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Dec 16 1999 - 10:44:20 PST