Re: [htdig] pdf parser: No error;) Search: No results;(


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 22 Feb 1999 13:23:16 -0600 (CST)


According to Joe R. Jah:
> I run ht/Dig 3.1.1 including the parser patch on a BSDI 4.0 box. In my
> htdig.config I have:
>
> pdf_parser: /usr/contrib/bin/pdftops
>
> rundig does not complain about any pdf files except two large files, for
> which I plan to increase:
>
> max_head_length: 50000
>
> to some very high number; however, search does not find any words in pdf
> files; they do not show up in any results.
>
> Has anyone successfully used pdftops to dig pdf files?
>
> I appreciate any pointers.

The code in htdig/PDF.cc expects the PostScript output from the pdf
parser to be in a very specific format -- the one that acroread outputs.
The latest version of xpdf is supposed to output PostScript in a
compatible format, from what I've read on this list, but I haven't seen
any mention of pdftops. My guess, given the lack of results you reported,
is that it's PostScript output is not compatible. If it doesn't find
the tags it expects in the PostScript, it won't give any error messages.
It'll just silently ignore what's there as it scans for the beginning
of text block marker.

As for dealing with large files, it's max_doc_size you need to adjust.
By default, it's 100000, so you need to increase it if dealing with files
larger than 100K. The max_head_length attribute determines how much of
the document text will be stored for excerpts, but this is done on the
processed text.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST