Re: [htdig] doc2html hangs while parsing PDFs


Subject: Re: [htdig] doc2html hangs while parsing PDFs
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Wed Jan 03 2001 - 07:09:05 PST


On Wed, 03 Jan 2001 15:59:57 +0100 Berthold Cogel
<cogel@rrz.uni-koeln.de> wrote:

> Hello!
>
> I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
> SunOS 5.7.
> To parse PDF documents I used doc2html and pdftotext. My first mistake
> was to leave max_doc_size at the default value. But I don't think that
> this was the reason for my problem:
>
> Sometimes doc2html hangs and eats resources and produces a unknown child
> process with <defunct> signature in the top list (perhaps pdftotext?).
>

There is a known bug in the hyphenation code in doc2html.pl
which causes it to loop indefinitely when parsing a .PDF
file when the last character is a hyphen. This
seems unlikely, but I have seen it.

In sub try_text change:

      while (<CAT>) {
        while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
          ($_ .= <CAT>) || last;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
        }
        s/\255/-/g; # replace dashes with hyphens

To:

     while (<CAT>) {
       while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
         $_ .= <CAT>;
         last if eof;
         s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
       }
       s/\255/-/g; # replace dashes with hyphens

> I don't think that the document size is a reason for this effect,
> because some of the files that caused the trouble (last line in
> htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
> 34 MByte) didn't stop doc2html.
>
> By the way: Where do I have to set $Verbose?

sub init {

  # set = 1 for O/P on stderr if successful
  $Verbose = 1;

 Is it possible to write the
> messages of pdftotext and doc2html in a separate logfile?
>

Perhaps in the next version of doc2html.

> Why doesn't take htdig/doc2html the complete document for parsing. You
> only have to take max_doc_size into account when you take the parsed
> documents for indexing. This might reduce the problems with doctypes
> other than html or plain text.

max_doc_size affects all documents fetched by htdig. It is
a safety device to prevent the downloading of extremely
large (or infinitely long!) documents.

>
> Thanks in advance
>
> Berthold Cogel
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-unsubscribe@htdig.org
> You will receive a message to confirm this.
> List archives: <http://www.htdig.org/mail/menu.html>
> FAQ: <http://www.htdig.org/FAQ.html>
>

----------------------
David Adams
D.J.Adams@soton.ac.uk

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 03 2001 - 07:20:49 PST