Re: [htdig] pdf parser: No error;) Search: No results;(


Joe R. Jah (jjah@cloud.ccsf.cc.ca.us)
Mon, 22 Feb 1999 14:23:01 -0800 (PST)


On Mon, 22 Feb 1999, Gilles Detillieux wrote:

> Date: Mon, 22 Feb 1999 13:23:16 -0600 (CST)
> From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
> To: "Joe R. Jah" <jjah@cloud.ccsf.cc.ca.us>
> Cc: htdig@htdig.org
> Subject: Re: [htdig] pdf parser: No error;) Search: No results;(
>
> According to Joe R. Jah:
> > I run ht/Dig 3.1.1 including the parser patch on a BSDI 4.0 box. In my
> > htdig.config I have:
> >
> > pdf_parser: /usr/contrib/bin/pdftops
> >
> > rundig does not complain about any pdf files except two large files, for
> > which I plan to increase:
> >
> > max_head_length: 50000
> >
> > to some very high number; however, search does not find any words in pdf
> > files; they do not show up in any results.
> >
> > Has anyone successfully used pdftops to dig pdf files?
> >
> > I appreciate any pointers.
>
> The code in htdig/PDF.cc expects the PostScript output from the pdf
> parser to be in a very specific format -- the one that acroread outputs.
> The latest version of xpdf is supposed to output PostScript in a
> compatible format, from what I've read on this list, but I haven't seen
> any mention of pdftops. My guess, given the lack of results you reported,
> is that it's PostScript output is not compatible. If it doesn't find
> the tags it expects in the PostScript, it won't give any error messages.
> It'll just silently ignore what's there as it scans for the beginning
> of text block marker.

pdftops is part of xpdf package; I just downloaded, compiled, installed
the latest version of xpdf, and randig. Still no search results;*(

> As for dealing with large files, it's max_doc_size you need to adjust.
> By default, it's 100000, so you need to increase it if dealing with files
> larger than 100K. The max_head_length attribute determines how much of
> the document text will be stored for excerpts, but this is done on the
> processed text.

Thanks; I stand corrected. I just added that line to my config file and
increased it to 650000 to cover all the existing pdf files in my search
path.

As a side note, I think it would be very helpful to have the sample config
file have the entire options present as default, perhaps with a short
comment.

Joe

     _/ _/_/_/ _/ ____________ __o
     _/ _/ _/ _/ ______________ _-\<,_
 _/ _/ _/_/_/ _/ _/ ......(_)/ (_)
  _/_/ oe _/ _/. _/_/ ah jjah@cloud.ccsf.cc.ca.us

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST