Re: [htdig] External converters - two questions


Subject: Re: [htdig] External converters - two questions
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Jan 10 2001 - 17:12:47 PST


According to David Adams:
> I hope to find time for a further revision of the external converter script
> doc2html.pl and possibly simplify it a little.
>
> The existing code includes de-hyphenation (which is buggy) taken originally
> from parsedoc.pl. The question is:
> is this necessary, does pdftotext (or any other utility) actually break up
> words across lines with the addition of hyphens? Is the hyphenation code of
> any use? Information and opinions are requested.

I added this code for dealing with a lot of the PDFs I needed to index
on my site, and for the Manitoba Unix User Group web site as well (for their
newsletters). Unlike HTML documents, I've found a lot of PDF files make
pretty heavy use of hyphenation. Without the dehyphenation code, hyphenated
words appeared as two separate words in the resulting text. E.g. "conv-
erter" was taken as "conv" and "erter", so a search for "converter" may
not turn up this document if the word didn't appear unbroken elsewhere
in the document.

Sorry about the EOF bug in this code. It was a quick hack, and I don't
know Perl all that well. There was a patch to fix this, though. Are there
any other bugs?

In any case, in parse_doc.pl and conv_doc.pl, I wrote it to be optional,
enabled by this line:

    $dehyphenate = 1; # PDFs often have hyphenated lines

which only applied to PDFs. The ps2ascii utility already does its own
dehyphenation, but pdftotext doesn't. Other document types are less
likely to need this. If dehyphenation of PDFs is not desired, it's easy
enough to change the 1 to a 0 above when configuring the script. I don't
recall if your doc2html.pl has the same sort of option.

> Also inherited from parsedoc.pl is extra code to cope with files which may
> be an "HP Print job" or contain a "MacBinary header". Are such files really
> encountered? If so what type of files are they, Word, PDF or what?
> Does the magic number code need to take account of them?

Another hack of mine. The MUUG web site had some pretty odd-ball
PostScript files on it that were causing error messages while indexing
their site. Instead of simple and pure PS in these files, some had a
MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would
skip over, but the Perl code wasn't accepting these files. These hacks
were to allow these files through. Dunno if anyone else has found they
help or hurt them, but I'm keeping them in my own copies of the scripts.
I know they're kind of ugly, so if you want to get rid of them in your
code for the sake of simplicity, I'd certainly understand.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 10 2001 - 17:26:36 PST