Rick Wiggins (email@example.com)
Tue, 23 Feb 1999 18:36:03 -0500
>Yes, Rick, this was extremely helpful! You're going to find this
>surprising, but the only reason htdig managed to index anything in your
>test.pdf file is because of a pure fluke. I ran pdftops on your test.pdf,
>and got an identical copy of test.ps that I got from your web site.
>I then searched for lines that begin with BT in this PS file, and got
>It's part of a compressed image at the bottom of page 51. That means
>that anything before page 52 did not get indexed. I tried searching
>for "inspection", which appears on page 51, and it didn't find it in
Well, crap! Thanks for pointing this out!
>I think htdig/PDF.cc should be modified to recognise the pdfStartPage and
>pdfEndPage operators that pdftops outputs, and use these as equivalent to
>BT & ET. Either that or key on %%EndPageSetup (to start scanning text)
>and %%PageTrailer (to stop scanning text) -- that may also work with other
>PDF to PS converters. I just hope that doesn't pose other problems.
>For example, if a BT can randomly appear in a compressed image, what
>about other operators? Is PDF.cc going to start seeing, and acting on,
>other phantom operators in the PostScript code if it scans the whole page?
>I assume that's why Sylvain Wallez split up the parsing of text lines
>and non-text lines in the first place, when he wrote this code.
>The other option would be to change the xpdf code to output the BT & ET
>tags. It's a cleaner solution, but it moves it more out of our control.
>Anyone care to comment?
Why can't we have it index a text version of the PDF? xpdf has a pdftotext
utility that could be used. What is the advantage of converting to
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST