Re: [htdig] External converters - two questions

Subject: Re: [htdig] External converters - two questions
From: David Adams (
Date: Thu Jan 11 2001 - 07:38:23 PST

Thanks Giles, that is usefull information. I had thought that perhaps
pdtotext actually *added* hyphenation to a document. If the problem is
removing the hyphenation that is actually written into the document then I
can see that not everbody will wish to do this. It is easily switched off
in, but only if you know where to look. The next version will
definitely be better in this respect.

As for magic numbers, I'll wait and see if anybody else can offer some
additional observations.

David Adams
Computing Services
Southampton University

----- Original Message ----- From: "Gilles Detillieux" <> To: "David Adams" <> Cc: <> Sent: Thursday, January 11, 2001 1:12 AM Subject: Re: [htdig] External converters - two questions

> According to David Adams: > > I hope to find time for a further revision of the external converter script > > and possibly simplify it a little. > > > > The existing code includes de-hyphenation (which is buggy) taken originally > > from The question is: > > is this necessary, does pdftotext (or any other utility) actually break up > > words across lines with the addition of hyphens? Is the hyphenation code of > > any use? Information and opinions are requested. > > I added this code for dealing with a lot of the PDFs I needed to index > on my site, and for the Manitoba Unix User Group web site as well (for their > newsletters). Unlike HTML documents, I've found a lot of PDF files make > pretty heavy use of hyphenation. Without the dehyphenation code, hyphenated > words appeared as two separate words in the resulting text. E.g. "conv- > erter" was taken as "conv" and "erter", so a search for "converter" may > not turn up this document if the word didn't appear unbroken elsewhere > in the document. > > Sorry about the EOF bug in this code. It was a quick hack, and I don't > know Perl all that well. There was a patch to fix this, though. Are there > any other bugs? > > In any case, in and, I wrote it to be optional, > enabled by this line: > > $dehyphenate = 1; # PDFs often have hyphenated lines > > which only applied to PDFs. The ps2ascii utility already does its own > dehyphenation, but pdftotext doesn't. Other document types are less > likely to need this. If dehyphenation of PDFs is not desired, it's easy > enough to change the 1 to a 0 above when configuring the script. I don't > recall if your has the same sort of option. > > > Also inherited from is extra code to cope with files which may > > be an "HP Print job" or contain a "MacBinary header". Are such files really > > encountered? If so what type of files are they, Word, PDF or what? > > Does the magic number code need to take account of them? > > Another hack of mine. The MUUG web site had some pretty odd-ball > PostScript files on it that were causing error messages while indexing > their site. Instead of simple and pure PS in these files, some had a > MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would > skip over, but the Perl code wasn't accepting these files. These hacks > were to allow these files through. Dunno if anyone else has found they > help or hurt them, but I'm keeping them in my own copies of the scripts. > I know they're kind of ugly, so if you want to get rid of them in your > code for the sake of simplicity, I'd certainly understand. > > -- > Gilles R. Detillieux E-mail: <> > Spinal Cord Research Centre WWW: > Dept. Physiology, U. of Manitoba Phone: (204)789-3766 > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 > > ------------------------------------ > To unsubscribe from the htdig mailing list, send a message to > > You will receive a message to confirm this. > List archives: <> > FAQ: <> > >

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this. List archives: <> FAQ: <>

This archive was generated by hypermail 2b28 : Thu Jan 11 2001 - 07:52:15 PST