Re: [htdig] Problem indexing PDF files


Joerg Behrens (behrens@noell.de)
Thu, 08 Jul 1999 19:17:51 +0200


PDF Files, the neverending story......

First, check if the webserver generate the right header for PDF files. You can
check this if you download the file via your normal browser and open them with
the acroreader. If not insert
application/pdf pdf
in the mimetype file.

When this works edit the htdig.conf file and insert the following line:
pdf_parser: /usr/local/Acrobat4/bin/acroread -toPostScript -pairs

I use the acroreader (version 4.0) and not the xpdf tool to parse pdf documents
with htdig. You can download the software from www.adobe.com. Before digging
again test this at the command line!!
/path_to/acroread --toPostScript input_file.pdf
You`ll recieve a input_file.ps
When running htdig be sure that you have enough diskspace at /tmp .

"Joakim Wiberg (HMS)" schrieb:

> Hello,
>
> I try to index a PDF file and I get the following error.
>
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
>
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.
>
> /Joakim
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word "unsubscribe" in
> the SUBJECT of the message.

--
Key fingerprint =  92 7D E0 A6 CF AE EC 32  14 28 EF 0D 57 2A 88 5B
----------------------------------------------------------------------
Preussag Noell Dienstleistungen
D-97080 Wuerzburg
Alfred-Nobel-Straße 20                         Tel: +49 931 903-2243
Abt: DV-C/tr                                   Fax: +49 511 903-2051

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Jul 08 1999 - 09:34:35 PDT