Re: xpdf 0.90 announcement (was Re: [htdig] parse_doc.pl slow)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 12 Aug 1999 14:22:10 -0500 (CDT)


According to Frank Guangxin Liu:
> Here is how I tested it:
> pdftotext.old -rawdump test.pdf
> grep F_Table test.txt
> can't find any match. (F_Table is a word in the landscape table
> on Page 54 of 72).
>
> pdftotext.new -raw test.pdf
> grep F_Table test.txt
> found the match!!
>
> I understand the "test.txt" generated from the new pdftotext
> still looks funny (unformated) for those landscape tables
> (Page 48 and beyond), but at least it has all the words in
> there which is all htdig cares.

But not all the words are intact. Here's an example of pdftotext output
from the PDF you gave me:

  Co
mpliance wit
h QS
P 1-
02, Pro
tection of Pro
prietary Interests,
 is re
quired. Info
rmation contained with
in this d
ocument or generated as a result thereof is no
t to be disclosed to third partie
s

Most of the words are intact, but a lot of them wrap onto another line,
so htdig treats the two parts as separate words. Yes, it's a lot better
than what you'd get with pdftotext 0.80, with my rawdump patch, but is it
as good as what you'd get from htdig's parsing of acroread's PostScript
output?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Aug 12 1999 - 12:23:00 PDT