Re: [htdig] infinite loop in doc2html.pl


Subject: Re: [htdig] infinite loop in doc2html.pl
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Wed Sep 20 2000 - 01:08:48 PDT


>
> Hello,
>
> I ran into an infinite loop using doc2html. When it parses a PDF document it tries to reassemble hyphenated words. Unfortunately, I have documents that end with a dash, like"text-", so the loop spins forever looking for the other half of the word. Adding a check for eof fixed it.
>
> in sub try_text()
>
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> }
> --
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> + last if eof;
> }
>
>
> Terry Luedtke
> National Library of Medicine
>

This bug fix arrived too late to go into version 2.1 of doc2html.pl
which is now available from the External Parsers section of
http://www.htdig.org/contrib/

Version 2.1 uses both the magic number and the MIME type to decide
which conversion utlitity to use, and is able to cope with:

        MS Word (most versions including Word2 and Word for MAC)
        MS Excel
        MS Powerpoint
        Wordperfect (purchase of wp2html necessary)
        Adobe PDF
        Postscript
        RTF

There are number of minor improvements, including a useful improvement
in the conversion of PDF files.

As for the future, the hyphenation code is nearly unchanged from
parsedoc.pl and clearly needs revision. This is not something I am
going to be able to spend much time on in the next few months, so if
someone would volunteer to take over code development I would be very
pleased to hand it on to them.

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Sep 20 2000 - 01:11:27 PDT