htdig: More on PDF parsing

Sylvain Wallez (
Thu, 20 Aug 1998 11:06:47 +0200

Hello htdiggers,

I was on vacation during the last 3 weeks, and I would like to answer to
the discussions that happened during that period about the PDF parser I

(1) modification and publication :
I wrote the parser for my own needs, and I'm pleased to give it to the
htdig community (for the lawyers, consider the gnu GPL). In other words,
take it, use it, patch it and if it's worth it, publish it !
Geoff : maybe it can be included in htdig 3.1 ?

(2) acroread vs xpdf, ghostscript, etc :
The reason for using a pdf converter (to postscript or text) is to
handle without effort the various compression schemes the pdf format
My pdf parser relies heavily on the way acroread translates pdf files to
postscript (text blocks are easy to find). I don't use xpdf or
ghostscript, so I didn't check their output format, and I don't know if
it's the same as acroread.
Michael : your idea about adding a config attribute to allow any pdf to
ps converter should be checked against that.

(3) acroread "cannot repair file" and "expected a dict object" errors :
These errors occur when the size of a document exceeds the max_doc_size
attribute of the htdig configuration. In that case, htdig truncates the
file, which is suitable for html or text files but not for pdf files :
acroread cannot parse incomplete files.
If this happens, increase the value of max_doc_size in the config file.
Hope this will help Malka.

(4) support
A I wrote this parser, I'm the one that best knows it :-) But before
requesting help to me, please consider the following :
- I use only HP workstations, so I'm not a cross-platform compilation
- I now work on a project with a short schedule and I have little time,
so *please* consult the readme file in the patch and the mailing list
archive (if it comes back to life) before requesting help.


Sylvain Wallez                  Software engineer / Intranet Webmaster
Alcatel Space Industries
Toulouse, France          
