Subject: Re: [htdig] infinite loop in doc2html.pl
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Wed Sep 20 2000 - 01:08:48 PDT
>
> Hello,
>
> I ran into an infinite loop using doc2html. When it parses a PDF document it tries to reassemble hyphenated words. Unfortunately, I have documents that end with a dash, like"text-", so the loop spins forever looking for the other half of the word. Adding a check for eof fixed it.
>
> in sub try_text()
>
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> }
> --
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> + last if eof;
> }
>
>
> Terry Luedtke
> National Library of Medicine
>
This bug fix arrived too late to go into version 2.1 of doc2html.pl
which is now available from the External Parsers section of
http://www.htdig.org/contrib/
Version 2.1 uses both the magic number and the MIME type to decide
which conversion utlitity to use, and is able to cope with:
MS Word (most versions including Word2 and Word for MAC)
MS Excel
MS Powerpoint
Wordperfect (purchase of wp2html necessary)
Adobe PDF
Postscript
RTF
There are number of minor improvements, including a useful improvement
in the conversion of PDF files.
As for the future, the hyphenation code is nearly unchanged from
parsedoc.pl and clearly needs revision. This is not something I am
going to be able to spend much time on in the next few months, so if
someone would volunteer to take over code development I would be very
pleased to hand it on to them.
-- David J Adams <D.J.Adams@soton.ac.uk> Computing Services University of Southampton------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>
This archive was generated by hypermail 2b28 : Wed Sep 20 2000 - 01:11:27 PDT