Re: [htdig] pdf parser: No error;) Search: No results;(

Gilles Detillieux (
Mon, 22 Feb 1999 13:23:16 -0600 (CST)

According to Joe R. Jah:
> I run ht/Dig 3.1.1 including the parser patch on a BSDI 4.0 box. In my
> htdig.config I have:
> pdf_parser: /usr/contrib/bin/pdftops
> rundig does not complain about any pdf files except two large files, for
> which I plan to increase:
> max_head_length: 50000
> to some very high number; however, search does not find any words in pdf
> files; they do not show up in any results.
> Has anyone successfully used pdftops to dig pdf files?
> I appreciate any pointers.

The code in htdig/ expects the PostScript output from the pdf
parser to be in a very specific format -- the one that acroread outputs.
The latest version of xpdf is supposed to output PostScript in a
compatible format, from what I've read on this list, but I haven't seen
any mention of pdftops. My guess, given the lack of results you reported,
is that it's PostScript output is not compatible. If it doesn't find
the tags it expects in the PostScript, it won't give any error messages.
It'll just silently ignore what's there as it scans for the beginning
of text block marker.

As for dealing with large files, it's max_doc_size you need to adjust.
By default, it's 100000, so you need to increase it if dealing with files
larger than 100K. The max_head_length attribute determines how much of
the document text will be stored for excerpts, but this is done on the
processed text.

