[htdig] External converters - two questions

Subject: [htdig] External converters - two questions
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Tue Jan 09 2001 - 01:51:20 PST

I hope to find time for a further revision of the external converter script
doc2html.pl and possibly simplify it a little.

The existing code includes de-hyphenation (which is buggy) taken originally
from parsedoc.pl. The question is:
is this necessary, does pdftotext (or any other utility) actually break up
words across lines with the addition of hyphens? Is the hyphenation code of
any use? Information and opinions are requested.

Also inherited from parsedoc.pl is extra code to cope with files which may
be an "HP Print job" or contain a "MacBinary header". Are such files really
encountered? If so what type of files are they, Word, PDF or what?
Does the magic number code need to take account of them?

David Adams
Computing Services
Southampton University

