Re:[htdig] parsing PDF with NT


Subject: Re:[htdig] parsing PDF with NT
From: Stéphane Baudet (sbaudet@araxe.fr)
Date: Wed Mar 01 2000 - 08:07:36 PST


Well thanks for your reply. I upgraded to 3.1.5, but I still have problems
parsing PDF files. I found that the temporary files retrieved by HtDig are a
little bigger than the original PDF files. I managed to keep it and tried to
open it with Acrobat reader. And actually, pages remain blank, so the file
should be corrupted.
For example, I have a PDF which size is 90076 bytes and HtDig retrieves a
temporary file in /tmp which size is 90386 bytes !!
Any idea ?

Stephane Baudet.

-----Message d'origine-----
De : Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
Envoyé : mardi, février 29, 2000 7:02 PM
À : Stéphane Baudet
Cc : htdig@htdig.org
Objet : Re: [htdig] parsing PDF with NT

According to =?iso-8859-1?Q?St=E9phane_Baudet?=:
> I successfully compiled HTDig 3.1.4 with Cygwin-B20.1 under Windows NT 4,

First of all, you may want to upgrade to 3.1.5 for the security fixes.

> and it works great for simple HTML files. But I need to index PDF files
and
> Adobe Acroread doesn't provide any parsing function under NT. I also tried
> xPdf package but maybe there is something I didn't understand about the
> configuration file of HTDIG.
> I put the following line in htdig.conf :
>
> external_parsers: application/pdf->plain/text
/opt/www/htdig/bin/pdftotext.exe

That should be text/plain, not plain/text. Also, I don't think you can call
pdftotext directly as an external converter, as the arguments won't be
right.
You'll probably need a wrapper script. If you have Perl on your NT box,
try contrib/conv_doc.pl.

> I also tried with Aladdin Ghostscript 6.0 and :
>
> pdf_parser: /opt/www/htdig/bin/pdf2ps.bat
>
> where pdf2ps.bat is the script provided with Ghostscript.

No, pdf_parser only works with acroread for Unix/Linux, and
it's -toPostScript
option.

> But nothing works ! I'd really like to use xpdf, but there is always a
> syntax error about the PDF input file which is in /tmp, like htdig didn't
> get it correctly and broke it !

Could be because of max_doc_size, but it could also be because pdftotext.exe
doesn't like the arguments it's being fed.

> So, if anybody already had success in indexing PDF under NT, please tell
me
> how !!
> Thank you !

I don't use NT, but I've heard that some have successfully used Perl-based
parsers on it.

See also
http://www.htdig.org/FAQ.html#q5.2
http://www.htdig.org/FAQ.html#q4.9
http://www.htdig.org/mail/1999/07/0164.html
http://www.htdig.org/mail/1999/11/0329.html

--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Mar 01 2000 - 08:20:20 PST