Re: [htdig] parse_doc.pl alterations


Subject: Re: [htdig] parse_doc.pl alterations
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Nov 26 1999 - 09:42:49 PST


According to David Adams:
> Well, since you ask, I noticed two problems with PDF files on our site:
>
> 1. the titles were often meaningless, having no connection with
> the contents.
>
> 2. pdftotext outputs some spurious non-ascii gibberish that is
> then indexed.
>
> I modified the code which outputs the title to always include the
> type, and to put any extracted title in double quotes or the filename
> in square brackets:
...
> To throw away the spurious "words" I simplified the code to replace
> all non-alphanumerics with spaces. I appreciate that many people would
> think that too drastic:
...
> The spurious output is nolonger indexed, but it does remain in the head,
> so there is further room for improvement.

Thanks, David. Those changes seem pretty specific to your own needs, so
I won't bother incorporating them into future parse_doc.pl releases. They're
in the mail archives now, though, in case anyone wants to refer to them.

The days of this script are probably numbered anyway, what with the new
external converters code going into htdig. I can see parse_doc.pl being
replaced by a more general document to text or html converter, so you don't
have to worry about the finicky details of the actual text parsing. By
passing the text back to the internal parsers, you'll have more consistent
parsing of documents, and likely less spurious words going into the
database.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Fri Nov 26 1999 - 09:54:43 PST