Re: xpdf 0.90 announcement (was Re: [htdig] parse_doc.pl slow)


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Thu, 12 Aug 1999 15:30:04 -0500 (EST)


On Thu, 12 Aug 1999, Gilles Detillieux wrote:

>
> According to Frank Guangxin Liu:
> > Here is how I tested it:
> > pdftotext.old -rawdump test.pdf
> > grep F_Table test.txt
> > can't find any match. (F_Table is a word in the landscape table
> > on Page 54 of 72).
> >
> > pdftotext.new -raw test.pdf
> > grep F_Table test.txt
> > found the match!!
> >
> > I understand the "test.txt" generated from the new pdftotext
> > still looks funny (unformated) for those landscape tables
> > (Page 48 and beyond), but at least it has all the words in
> > there which is all htdig cares.
>
> But not all the words are intact. Here's an example of pdftotext output
> from the PDF you gave me:
>
> Co
> mpliance wit
> h QS
> P 1-
> 02, Pro
> tection of Pro
> prietary Interests,
> is re
> quired. Info
> rmation contained with
> in this d
> ocument or generated as a result thereof is no
> t to be disclosed to third partie
> s
>
> Most of the words are intact, but a lot of them wrap onto another line,
> so htdig treats the two parts as separate words. Yes, it's a lot better
> than what you'd get with pdftotext 0.80, with my rawdump patch, but is it
> as good as what you'd get from htdig's parsing of acroread's PostScript
> output?

You are right. Though the new pdftotext is better, it isn't as
good as acroread yet.
Thanks!
Frank

>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word unsubscribe in
> the SUBJECT of the message.
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Aug 12 1999 - 13:31:24 PDT