Re: [htdig] pdf parser: No error;) Search: No results;(


Rick Wiggins (wiggins@gwis.com)
Tue, 23 Feb 1999 18:36:03 -0500


>Yes, Rick, this was extremely helpful! You're going to find this
>surprising, but the only reason htdig managed to index anything in your
>test.pdf file is because of a pure fluke. I ran pdftops on your test.pdf,
>and got an identical copy of test.ps that I got from your web site.
>I then searched for lines that begin with BT in this PS file, and got
>this line:
>
>BT(hG08Gqd'][NI$^4&%j^*m-i`K3"5pqCqFYc.k=\#G28;LQ[-W`*&:ES#dVmu2T
>
>It's part of a compressed image at the bottom of page 51. That means
>that anything before page 52 did not get indexed. I tried searching
>for "inspection", which appears on page 51, and it didn't find it in
>your index.

Well, crap! Thanks for pointing this out!

>I think htdig/PDF.cc should be modified to recognise the pdfStartPage and
>pdfEndPage operators that pdftops outputs, and use these as equivalent to
>BT & ET. Either that or key on %%EndPageSetup (to start scanning text)
>and %%PageTrailer (to stop scanning text) -- that may also work with other
>PDF to PS converters. I just hope that doesn't pose other problems.
>For example, if a BT can randomly appear in a compressed image, what
>about other operators? Is PDF.cc going to start seeing, and acting on,
>other phantom operators in the PostScript code if it scans the whole page?
>I assume that's why Sylvain Wallez split up the parsing of text lines
>and non-text lines in the first place, when he wrote this code.
>
>The other option would be to change the xpdf code to output the BT & ET
>tags. It's a cleaner solution, but it moves it more out of our control.
>
>Anyone care to comment?

Why can't we have it index a text version of the PDF? xpdf has a pdftotext
utility that could be used. What is the advantage of converting to
PostScript?

Rick

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST