[htdig] doc2html hangs while parsing PDFs

Subject: [htdig] doc2html hangs while parsing PDFs
From: Berthold Cogel (cogel@rrz.uni-koeln.de)
Date: Wed Jan 03 2001 - 06:59:57 PST


I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
SunOS 5.7.
To parse PDF documents I used doc2html and pdftotext. My first mistake
was to leave max_doc_size at the default value. But I don't think that
this was the reason for my problem:

Sometimes doc2html hangs and eats resources and produces a unknown child
process with <defunct> signature in the top list (perhaps pdftotext?).

I don't think that the document size is a reason for this effect,
because some of the files that caused the trouble (last line in
htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
34 MByte) didn't stop doc2html.

By the way: Where do I have to set $Verbose? Is it possible to write the
messages of pdftotext and doc2html in a separate logfile?

Why doesn't take htdig/doc2html the complete document for parsing. You
only have to take max_doc_size into account when you take the parsed
documents for indexing. This might reduce the problems with doctypes
other than html or plain text.

Thanks in advance

Berthold Cogel

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>

This archive was generated by hypermail 2b28 : Wed Jan 03 2001 - 07:01:37 PST