[htdig] doc2html hangs while parsing PDFs


Subject: [htdig] doc2html hangs while parsing PDFs
From: Berthold Cogel (cogel@rrz.uni-koeln.de)
Date: Wed Jan 03 2001 - 06:59:57 PST


Hello!

I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
SunOS 5.7.
To parse PDF documents I used doc2html and pdftotext. My first mistake
was to leave max_doc_size at the default value. But I don't think that
this was the reason for my problem:

Sometimes doc2html hangs and eats resources and produces a unknown child
process with <defunct> signature in the top list (perhaps pdftotext?).

I don't think that the document size is a reason for this effect,
because some of the files that caused the trouble (last line in
htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
34 MByte) didn't stop doc2html.

By the way: Where do I have to set $Verbose? Is it possible to write the
messages of pdftotext and doc2html in a separate logfile?

Why doesn't take htdig/doc2html the complete document for parsing. You
only have to take max_doc_size into account when you take the parsed
documents for indexing. This might reduce the problems with doctypes
other than html or plain text.

Thanks in advance

Berthold Cogel

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 03 2001 - 07:01:37 PST