Joe R. Jah (email@example.com)
Mon, 22 Feb 1999 14:23:01 -0800 (PST)
On Mon, 22 Feb 1999, Gilles Detillieux wrote:
> Date: Mon, 22 Feb 1999 13:23:16 -0600 (CST)
> From: Gilles Detillieux <firstname.lastname@example.org>
> To: "Joe R. Jah" <email@example.com>
> Cc: firstname.lastname@example.org
> Subject: Re: [htdig] pdf parser: No error;) Search: No results;(
> According to Joe R. Jah:
> > I run ht/Dig 3.1.1 including the parser patch on a BSDI 4.0 box. In my
> > htdig.config I have:
> > pdf_parser: /usr/contrib/bin/pdftops
> > rundig does not complain about any pdf files except two large files, for
> > which I plan to increase:
> > max_head_length: 50000
> > to some very high number; however, search does not find any words in pdf
> > files; they do not show up in any results.
> > Has anyone successfully used pdftops to dig pdf files?
> > I appreciate any pointers.
> The code in htdig/PDF.cc expects the PostScript output from the pdf
> parser to be in a very specific format -- the one that acroread outputs.
> The latest version of xpdf is supposed to output PostScript in a
> compatible format, from what I've read on this list, but I haven't seen
> any mention of pdftops. My guess, given the lack of results you reported,
> is that it's PostScript output is not compatible. If it doesn't find
> the tags it expects in the PostScript, it won't give any error messages.
> It'll just silently ignore what's there as it scans for the beginning
> of text block marker.
pdftops is part of xpdf package; I just downloaded, compiled, installed
the latest version of xpdf, and randig. Still no search results;*(
> As for dealing with large files, it's max_doc_size you need to adjust.
> By default, it's 100000, so you need to increase it if dealing with files
> larger than 100K. The max_head_length attribute determines how much of
> the document text will be stored for excerpts, but this is done on the
> processed text.
Thanks; I stand corrected. I just added that line to my config file and
increased it to 650000 to cover all the existing pdf files in my search
As a side note, I think it would be very helpful to have the sample config
file have the entire options present as default, perhaps with a short
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah email@example.com
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST