[htdig] (?) Re: htdig: Question: French words

Serge Rossi (Serge.Rossi@renault.com)
Tue, 26 Jan 1999 09:01:41 -0800

Glen Newton wrote:

> Couldn't there be a change which would do the following:
> 1 - have a reference file which had "mappings", i.e.
> è = e
> é = e
> è = e
> 2 - whenever a single word like "polymères" was encountered,
> the word would be remapped into a new word using the
> mapping table, and then both words would be indexed,
> i.e. "polymères" and polymeres". Then when a person typed
> in "polymere" they would get a hit for "polymères"...

There is a similar problem when indexing a lot of PDF files with
an additional trick : the accentuated character codes are not the same
if the PDF file was generated on a Windows PC or on a Mac !

My solution : a little script to replace accentuated characters by their
plain ACSII equivalent (script called in PDF.cc in htdig) which
the output of acroread with sed and a loooong list of commands :

Here is the list of commands which did a fairly good job for PDF files
containing French accents generated on PC or Mac :

sed -e 's/\\21[01234]/a/g;s/\\21[67]/e/g;s/\\22[01]/e/g;s/\\22[345]/i/g;
s/\\23[0123]/o/g;s/\\227/o/g;s/\\23[4567]/u/g;s/\\215/c/g;s/\\226/n/g; \
s/\\37[1234]/u/g;s/\\347/c/g;s/\\30[012345]/A/g;s/\\31[0123]/E/g; \

Quite ugly but it works :-)

    Serge Rossi Tél. : 01 34 55 91 14
    Créos - Groupe Renault Fax : 01 34 55 96 76
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Sun Jan 31 1999 - 10:43:20 PST