Re: [htdig] pdf parser: No error;) Search: No results;(


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 23 Feb 1999 16:33:38 -0600 (CST)


According to Rick Wiggins:
> Well, the big project that we were going to do that required indexing PDFs
> hasn't progressed from the point it was at when I promised to provide an
> update. So, no update yet. My information is based on using htdig 3.1.0b4
> (with minor modifications) and pdftops (from xpdf 0.80) to index ONE PDF
> document. This worked fine for me!
>
> The change I made to htdig was in PDF.cc:
>
> // acroread << " -toPostScript " << pdfName << " " << tmpdir << " 2>&1";
> acroread << " " << pdfName << " " << psName << " 2>&1";
>
> This appears to make pdftops happy.
>
> I've put my test document back on my web site and re-indexed the site so
> that you can see that it works yourself. If you go to
> http://www.gwis.com/search and search for 'isp' you will see
> 'http://www.gwis.com/help/test.pdf' listed as the second hit. I've placed
> a PostScript copy at http://www.gwis.com/help/test.ps.
>
> I hope this helps. Please let me know what else you discover or if you
> would like any additional information...

Yes, Rick, this was extremely helpful! You're going to find this
surprising, but the only reason htdig managed to index anything in your
test.pdf file is because of a pure fluke. I ran pdftops on your test.pdf,
and got an identical copy of test.ps that I got from your web site.
I then searched for lines that begin with BT in this PS file, and got
this line:

BT(hG08Gqd'][NI$^4&%j^*m-i`K3"5pqCqFYc.k=\#G28;LQ[-W`*&:ES#dVmu2T

It's part of a compressed image at the bottom of page 51. That means
that anything before page 52 did not get indexed. I tried searching
for "inspection", which appears on page 51, and it didn't find it in
your index.

I think htdig/PDF.cc should be modified to recognise the pdfStartPage and
pdfEndPage operators that pdftops outputs, and use these as equivalent to
BT & ET. Either that or key on %%EndPageSetup (to start scanning text)
and %%PageTrailer (to stop scanning text) -- that may also work with other
PDF to PS converters. I just hope that doesn't pose other problems.
For example, if a BT can randomly appear in a compressed image, what
about other operators? Is PDF.cc going to start seeing, and acting on,
other phantom operators in the PostScript code if it scans the whole page?
I assume that's why Sylvain Wallez split up the parsing of text lines
and non-text lines in the first place, when he wrote this code.

The other option would be to change the xpdf code to output the BT & ET
tags. It's a cleaner solution, but it moves it more out of our control.

Anyone care to comment?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST