RE: [htdig] parsing PDF with NT


Subject: RE: [htdig] parsing PDF with NT
From: Stéphane Baudet (sbaudet@araxe.fr)
Date: Wed Mar 01 2000 - 09:23:38 PST


Yess it works now !! I just added "wb" as the second argument, then
recompiled htdig. And now it works perfectly ! HtDig retrieves the correct
size for temporary files and all works well in the best world !
To parse the PDF files, I used conv_doc.pl with the following line in
htdig.conf :

external_parsers: application/pdf->text/html "d:/perl/bin/perl.exe
/opt/www/htdig/bin/conv_doc.pl"

But it should work also with parse_doc.pl I think.

Thank you for your help, you're great ;)
See Ya !

Stephane Baudet.

-----Message d'origine-----
De : Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
Envoyé : mercredi, mars 01, 2000 6:01 PM
À : Stéphane Baudet
Cc : htdig@htdig.org
Objet : Re: [htdig] parsing PDF with NT

According to =?iso-8859-1?Q?St=E9phane_Baudet?=:
> Well thanks for your reply. I upgraded to 3.1.5, but I still have problems
> parsing PDF files. I found that the temporary files retrieved by HtDig are
a
> little bigger than the original PDF files. I managed to keep it and tried
to
> open it with Acrobat reader. And actually, pages remain blank, so the file
> should be corrupted.
> For example, I have a PDF which size is 90076 bytes and HtDig retrieves a
> temporary file in /tmp which size is 90386 bytes !!
> Any idea ?

Well, I'm going out on a limb here, because I'm really not familiar with
the Cygwin package, but if it makes a distinction between writing to
binary files vs. text files, adding CRs before LFs on text files, then this
could be the problem here. htdig/ExternalParser.cc creates its temporary
file using:

    FILE *fl = fopen(path, "w");

If this causes the Cygwin library to do CR/LF expansion, you'd need to
change this to avoid that problem, e.g. by using "wb" as the second
argument, if that's what it takes, or somehow setting O_BINARY mode on
the file. Have a look at the Cygwin docs, and please let us know if you
find a fix - we'll try to incorporate a portable form of it in future
releases.

--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Mar 01 2000 - 09:36:28 PST