Subject: Re: [htdig] parsing PDF with NT
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Tue Feb 29 2000 - 10:01:36 PST
According to =?iso-8859-1?Q?St=E9phane_Baudet?=:
> I successfully compiled HTDig 3.1.4 with Cygwin-B20.1 under Windows NT 4,
First of all, you may want to upgrade to 3.1.5 for the security fixes.
> and it works great for simple HTML files. But I need to index PDF files and
> Adobe Acroread doesn't provide any parsing function under NT. I also tried
> xPdf package but maybe there is something I didn't understand about the
> configuration file of HTDIG.
> I put the following line in htdig.conf :
> external_parsers: application/pdf->plain/text /opt/www/htdig/bin/pdftotext.exe
That should be text/plain, not plain/text. Also, I don't think you can call
pdftotext directly as an external converter, as the arguments won't be right.
You'll probably need a wrapper script. If you have Perl on your NT box,
> I also tried with Aladdin Ghostscript 6.0 and :
> pdf_parser: /opt/www/htdig/bin/pdf2ps.bat
> where pdf2ps.bat is the script provided with Ghostscript.
No, pdf_parser only works with acroread for Unix/Linux, and it's -toPostScript
> But nothing works ! I'd really like to use xpdf, but there is always a
> syntax error about the PDF input file which is in /tmp, like htdig didn't
> get it correctly and broke it !
Could be because of max_doc_size, but it could also be because pdftotext.exe
doesn't like the arguments it's being fed.
> So, if anybody already had success in indexing PDF under NT, please tell me
> how !!
> Thank you !
I don't use NT, but I've heard that some have successfully used Perl-based
parsers on it.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Feb 29 2000 - 10:05:51 PST