htdig: More on PDF parsing


Sylvain Wallez (s.wallez.alcatel@e-mail.com)
Thu, 20 Aug 1998 11:06:47 +0200


Hello htdiggers,

I was on vacation during the last 3 weeks, and I would like to answer to
the discussions that happened during that period about the PDF parser I
wrote.

(1) modification and publication :
I wrote the parser for my own needs, and I'm pleased to give it to the
htdig community (for the lawyers, consider the gnu GPL). In other words,
take it, use it, patch it and if it's worth it, publish it !
Geoff : maybe it can be included in htdig 3.1 ?

(2) acroread vs xpdf, ghostscript, etc :
The reason for using a pdf converter (to postscript or text) is to
handle without effort the various compression schemes the pdf format
supports.
My pdf parser relies heavily on the way acroread translates pdf files to
postscript (text blocks are easy to find). I don't use xpdf or
ghostscript, so I didn't check their output format, and I don't know if
it's the same as acroread.
Michael : your idea about adding a config attribute to allow any pdf to
ps converter should be checked against that.

(3) acroread "cannot repair file" and "expected a dict object" errors :
These errors occur when the size of a document exceeds the max_doc_size
attribute of the htdig configuration. In that case, htdig truncates the
file, which is suitable for html or text files but not for pdf files :
acroread cannot parse incomplete files.
If this happens, increase the value of max_doc_size in the config file.
Hope this will help Malka.

(4) support
A I wrote this parser, I'm the one that best knows it :-) But before
requesting help to me, please consider the following :
- I use only HP workstations, so I'm not a cross-platform compilation
guru.
- I now work on a project with a short schedule and I have little time,
so *please* consult the readme file in the patch and the mailing list
archive (if it comes back to life) before requesting help.

Regards.

-- 
----------------------------------------------------------------------
Sylvain Wallez                  Software engineer / Intranet Webmaster
Alcatel Space Industries
Toulouse, France                    mailto:s.wallez.alcatel@e-mail.com
----------------------------------------------------------------------
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:16 PST