Re: [htdig] alterations

Subject: Re: [htdig] alterations
From: Gilles Detillieux (
Date: Fri Nov 26 1999 - 09:42:49 PST

According to David Adams:
> Well, since you ask, I noticed two problems with PDF files on our site:
> 1. the titles were often meaningless, having no connection with
> the contents.
> 2. pdftotext outputs some spurious non-ascii gibberish that is
> then indexed.
> I modified the code which outputs the title to always include the
> type, and to put any extracted title in double quotes or the filename
> in square brackets:
> To throw away the spurious "words" I simplified the code to replace
> all non-alphanumerics with spaces. I appreciate that many people would
> think that too drastic:
> The spurious output is nolonger indexed, but it does remain in the head,
> so there is further room for improvement.

Thanks, David. Those changes seem pretty specific to your own needs, so
I won't bother incorporating them into future releases. They're
in the mail archives now, though, in case anyone wants to refer to them.

The days of this script are probably numbered anyway, what with the new
external converters code going into htdig. I can see being
replaced by a more general document to text or html converter, so you don't
have to worry about the finicky details of the actual text parsing. By
passing the text back to the internal parsers, you'll have more consistent
parsing of documents, and likely less spurious words going into the

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You'll receive a message confirming the unsubscription.

This archive was generated by hypermail 2b25 : Fri Nov 26 1999 - 09:54:43 PST