Re: htdig: ht3.1.0b1 and PDF


Michael J. Long (mjlong@summa4.com)
Fri, 11 Sep 1998 17:38:04 -0400


Geoff Hutchison wrote:

[...snip...]

> there was a lot of discussion about
> using other programs to parse PDF files. I don't think anyone has tested
> using other programs,

I have looked at the output from acroread and from xpdf's version of
pdftops and they differ slightly. Sylvain's PDF module uses acroread
specific tags (BT and ET) to determine where to start searching for
words to index. Unfortunately, pdftops does not insert these tags into
the PostScript output.

Therefore, the PDF module will not work with pdftops as is. I have some
theories on how to tweak the PDF module to work with both:
        - convert the pdf to ps and use the Postscript module to
          parse it (looking at the way the modules work, I don't
          know if this is possible, I haven't look at it that much
          though)
        - convert the pdf to text and parse the text
        - improve the parsing capability by stealing code from
          the Postscript module

Anyone out there have any nuggets of wisdom you can impart?

> but I figured it would be better to name it
> "pdf_parser" than "acroread" anyway.

Good choice.

[...snip...]

Michael J. Long

-- 
* Michael J. Long * #include "std/disclaimer.h"
*   Summa Four    * Work: mjlong@Summa4.COM
* Manchester, NH  * Play: mjlong@mindspring.com
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:47 PST