Re: htdig: Indexing PDFs using 'xpdf' instead of Acroread


Geoff Hutchison (ghutchis@wso.williams.edu)
Mon, 11 Jan 1999 13:15:51 -0400


At 11:35 AM -0400 1/11/99, Rick Wiggins wrote:

>Perhaps future versions of 'htdig' can generalize the 'pdf_parser'
>attribute such that this modification would not be necessary when using
>programs other than Acroread. Just a thought...

Future versions will do so (see the TODO.html file). However, see below.

>comes with a 'pdftops' utility program. To use this program, I had to
>modify 'htdig' so that it wouldn't include the '-toPostScript' command
>option and would completely specify the output filename, like this:

Mm. Last time this came up, when the PDF parser was first included, I was
given a pretty definitive answer from Michael J. Long <mjlong@summa4.com>:

>I have looked at the output from acroread and from xpdf's version of
>pdftops and they differ slightly. Sylvain's PDF module uses acroread
>specific tags (BT and ET) to determine where to start searching for
>words to index. Unfortunately, pdftops does not insert these tags into
>the PostScript output.
>
>Therefore, the PDF module will not work with pdftops as is. I have some
>theories on how to tweak the PDF module to work with both:
> - convert the pdf to ps and use the Postscript module to
> parse it (looking at the way the modules work, I don't
> know if this is possible, I haven't look at it that much
> though)
> - convert the pdf to text and parse the text
> - improve the parsing capability by stealing code from
> the Postscript module

Now if the situation has changed, let me know. In the meantime, I'm not
going to suggest using xpdf. I'd rather not suggest acroread since it's not
open source. But...

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 13 1999 - 09:13:04 PST