[htdig] (?) Re: htdig: Question: French words


Serge Rossi (Serge.Rossi@renault.com)
Tue, 26 Jan 1999 01:04:33 -0800


Glen Newton wrote:

> Couldn't there be a change which would do the following:
>
> 1 - have a reference file which had "mappings", i.e.
> è = e
> é = e
> è = e
> 2 - whenever a single word like "polymères" was encountered,
> the word would be remapped into a new word using the
> mapping table, and then both words would be indexed,
> i.e. "polymères" and polymeres". Then when a person typed
> in "polymere" they would get a hit for "polymères"...

There is a similar problem when indexing a lot of PDF files with
an additional trick : the accentuated character codes are not the same
if the PDF file was generated on a Windows PC or on a Mac !

My solution : a little script (called in PDF.cc) which translates
the output of acroread with sed and a loooong list of commands when
indexing :

Here is the list of commands which did a fairly good job for PDF files
containing French accents generated on PC or Mac :

sed -e 's/\\21[01234]/a/g;s/\\21[67]/e/g;s/\\22[01]/e/g;s/\\22[345]/i/g;
\
s/\\23[0123]/o/g;s/\\227/o/g;s/\\23[4567]/u/g;s/\\215/c/g;s/\\226/n/g; \
s/\\34[0123456]/a/g;s/\\35[0123]/e/g;s/\\35[4567]/i/g;s/\\36[23456]/o/g;
\
s/\\37[1234]/u/g;s/\\347/c/g;s/\\30[012345]/A/g;s/\\31[0123]/E/g; \
s/\\31[4567]/I/g;s/\\32[23456]/O/g;s/\\33[1234]/U/g;s/\\307/C/g'

Quite ugly but it works :-)

_____________________________________________________________________
    Serge Rossi Tél. : 01 34 55 91 14
    Créos - Groupe Renault Fax : 01 34 55 96 76
_____________________________________________________________________
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Jan 26 1999 - 08:10:38 PST