Re: htdig: Indexing PDFs using 'xpdf' instead of Acroread


Rick Wiggins (wiggins@gwis.com)
Mon, 11 Jan 1999 15:31:56 -0500


At 1:15 PM -0400 1/11/99, Geoff Hutchison wrote:
>At 11:35 AM -0400 1/11/99, Rick Wiggins wrote:
>
>>Perhaps future versions of 'htdig' can generalize the 'pdf_parser'
>>attribute such that this modification would not be necessary when using
>>programs other than Acroread. Just a thought...
>
>Future versions will do so (see the TODO.html file). However, see below.
>
>>comes with a 'pdftops' utility program. To use this program, I had to
>>modify 'htdig' so that it wouldn't include the '-toPostScript' command
>>option and would completely specify the output filename, like this:
>
>Mm. Last time this came up, when the PDF parser was first included, I was
>given a pretty definitive answer from Michael J. Long <mjlong@summa4.com>:
>
>>I have looked at the output from acroread and from xpdf's version of
>>pdftops and they differ slightly. Sylvain's PDF module uses acroread
>>specific tags (BT and ET) to determine where to start searching for
>>words to index. Unfortunately, pdftops does not insert these tags into
>>the PostScript output.
>>
>>Therefore, the PDF module will not work with pdftops as is. I have some
>>theories on how to tweak the PDF module to work with both:
>> - convert the pdf to ps and use the Postscript module to
>> parse it (looking at the way the modules work, I don't
>> know if this is possible, I haven't look at it that much
>> though)
>> - convert the pdf to text and parse the text
>> - improve the parsing capability by stealing code from
>> the Postscript module
>
>Now if the situation has changed, let me know. In the meantime, I'm not
>going to suggest using xpdf. I'd rather not suggest acroread since it's not
>open source. But...

Interesting. 'pdftops' seems to be working fine for me. :-/ I'm using
version 0.80 of 'xpdf' which came out on Nov. 27, 1998. Perhaps this
problem has been corrected in this version? We'll be indexing a large
number of PDFs in the near future. I'll report back how it goes using
'pdftops'...

Rick

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 13 1999 - 09:13:04 PST