Re: [htdig] parse_doc.pl slow


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Thu, 12 Aug 1999 08:51:27 -0500 (EST)


On Tue, 20 Jul 1999, Gilles Detillieux wrote:

>
> According to me:
> > According to Frank Guangxin Liu:
> > > This afternoon, I noticed htdig didn't do anything except
> > > running parse_doc.pl on a pdf file. The file is about
> > > 700k, ~80 pages of text. I tried run pdftotext on this
> > > file and it took about a minute to produce a 6M text file.
> > > Both xpdf and acroread can open this file almost immediately.
> > > I am wondering why it took parse_doc.pl the whole afternoon
> > > to parse this one file. "top" shows it uses 90% of CPU.
> > > Is there anything we can do to speed up "parse_doc.pl"?
> > > If any of you want to re-produce this, I can send you
> > > the pdf file.
> > > After this file, I keep checking how htdig runs, it seems
> > > to me it almost always takes more than an hour to
> > > parse_doc.pl a pdf file. This really is unacceptable.
> > >
> > > By the way, I switch to use parse_doc.pl from acroread
> > > this weekend after reading the FAQ.
> >
> > parse_doc.pl is an interpreted Perl script, so it's not going to
> > be super efficient. However, more than one hour to parse an 80 page
> > document seems quite unusually long. I don't have PDFs that large, but
> > on my system a 2 page PDF gets parsed in under a second. I have a 200
> > MHz AMD-K6 with 64 MB RAM, running Linux kernel 2.0.36 and Perl 5.004.
> > How does that compare to what you have? Have you noticed any difference
> > if you run parse_doc.pl directly on one of these PDFs, instead of running
> > it from htdig? If you let me know where I could fetch a copy of this PDF,
> > I'll try it out on my system.
>
> Frank & I continued this discussion off the list, but for the benefit
> of those who are following (and for the archives), I thought I'd post
> a summary.
>
> It turns out the problem was caused by some pages in the PDF that
> contained tables in landscape orientation. These caused major confusion
> for pdftotext, leading it to put out hundreds of very long lines (~7KB),
> which slowed the perl script parse_doc.pl to a crawl. Adding the
> -rawdump option to pdftotext (which requires a patch, available at

I saw the new xpdf 0.90 is out. Has anybody tried that? Any
improvement? Does it include the patches (deltax, rawdump) from
htdig ftp site?

Thanks!
Frank

> http://www.htdig.org/files/contrib/parsers/) sped things up considerably
> (from 1.5 hrs to 22 sec on my system), but pdftotext still isn't putting
> out intelligible text for these landscape pages. I recommended to Frank
> that he notify Derek Noonburg, author of pdftotext and the xpdf package,
> to let him know of the problem. It remains to be seen whether htdig's
> parsing of acroread's PostScript output would do a better job of indexing
> these particular documents.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word "unsubscribe" in
> the SUBJECT of the message.
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Aug 12 1999 - 07:01:10 PDT