[htdig] parse_doc.pl slow

Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Mon, 19 Jul 1999 20:33:13 -0500 (EST)

This afternoon, I noticed htdig didn't do anything except
running parse_doc.pl on a pdf file. The file is about
700k, ~80 pages of text. I tried run pdftotext on this
file and it took about a minute to produce a 6M text file.
Both xpdf and acroread can open this file almost immediately.
I am wondering why it took parse_doc.pl the whole afternoon
to parse this one file. "top" shows it uses 90% of CPU.
Is there anything we can do to speed up "parse_doc.pl"?
If any of you want to re-produce this, I can send you
the pdf file.
After this file, I keep checking how htdig runs, it seems
to me it almost always takes more than an hour to
parse_doc.pl a pdf file. This really is unacceptable.

By the way, I switch to use parse_doc.pl from acroread
this weekend after reading the FAQ.


To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Mon Jul 19 1999 - 17:50:59 PDT