Re: xpdf 0.90 announcement (was Re: [htdig] parse_doc.pl slow)


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Thu, 12 Aug 1999 12:33:49 -0500 (EST)


On Thu, 12 Aug 1999, Gilles Detillieux wrote:

>
> According to Frank Guangxin Liu:
> > I just installed and tested the new xpdf 0.90.
> > The new pdftotext has an option "-raw" which should be same
> > as the old patched -rawdump I guess.
> > It also has the deltax fixes included.
>
> Yes, I just installed and tested it here myself. I built it without
> t1lib, because I haven't yet figured out how to compile and install t1lib.
> I just found a t1lib source RPM, so I'll give that a try next. The -raw
> option is an improvement over my -rawdump option, in that the text is
> formatted better. (That doesn't really matter for indexing, though.)
>
> > "xpdf" seems to be able to display landscape tables without
> > a problem on XFree86 server, but not on my MetroLink X server.
> > "pdftotext" still generate huge text file (huge lines for
> > the landscape tables), but "pdftotext -raw" can generate
> > reasonable sized file (as we found before). The good news
> > is the text file DOES have those keywords from the landscape
> > tables!!
>
> Hmmm. I tried the new pdftotext on the test.pdf you had given me back
> when you ran into the problem with landscape tables, and it's still not
> putting out very meaningful text. It's better than before, but it's
> still breaking up a whole lot of the words. I highly doubt the absence
> of t1lib would make a difference to pdftotext, but I could be wrong.
> I'll let you know if I spot a difference. However, you should try your
> new pdftotext on the test.pdf file you gave me, and look for what it
> puts out for Page 48 and on. You may still find that for your files,
> acroread works better.
>

Here is how I tested it:
pdftotext.old -rawdump test.pdf
grep F_Table test.txt
can't find any match. (F_Table is a word in the landscape table
                       on Page 54 of 72).

pdftotext.new -raw test.pdf
grep F_Table test.txt
found the match!!

I understand the "test.txt" generated from the new pdftotext
still looks funny (unformated) for those landscape tables
(Page 48 and beyond), but at least it has all the words in
there which is all htdig cares.

By the way, as I said in the previous email, "xpdf" gui can
display landscape tables without a problem on a XFree86 server.

Frank

> > I would highly recommend people to upgrade to xpdf-0.90.
> > It also supports PDF 1.3 as in the annoucement.
>
> Ditto. With the improvements, plus avoiding the need for patches,
> it's the way to go. Even if it doesn't completely solve your problem,
> for most other situations it does an excellent job. I'll update the
> FAQ and parse_doc.pl comments to include the new version number and
> -raw option.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word unsubscribe in
> the SUBJECT of the message.
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Aug 12 1999 - 10:35:16 PDT