Philippe Riviere (Philippe.Riviere@Monde-diplomatique.fr)
Tue, 25 May 1999 11:41:53 +0200
Hi,
I need htdig/htsearch to comprehend accents the following way :
"étude" and "etude" must be the same words as well when retrieving as when
searching ; I was not satisfied by answer on the htdig users list (too much
overhead in general when you have to recompile the words database and use
fuzzy matching) and I think the answer must lie somewhere near the
lowercase function (after all, why are 'E' and 'e' treated equally and not
'é' ?)
So I patched htlib/String.cc in a quick&dirty way, to make sure it would
work. And indeed it works (you have to reindex the whole site though). But
I'm not good enough at programming to finish this patch, and even less to
know for sure how to finish it.
So here it is, your comments and extensions are more than welcome!
----------------------------------------------------------------------X
void String::lowercase()
{
for (int i = 0; i < Length; i++)
{
// if (isupper(Data[i]))
Data[i] = tolower((unsigned char)Data[i]);
/////////////// START PATCH for 'é'
if ((unsigned char)Data[i] == 233) Data[i] = 101 ;
/////////////// END PATCH
}
---------------------------------------------------------------------X
As you see, there will be numerous problems : where does one get the list
of conversions? What to do when conversions are not char to char (ex ö ->
oe) ? etc.
This should even make the databases a bit smaller...
-- Philippe Riviere <Philippe.Riviere@monde-diplomatique.fr>
Le Monde diplomatique http://www.monde-diplomatique.fr/
21b, rue Claude-Bernard 75005 Paris tel: 33 1 42 17 37 46
Le Monde diplomatique in English: http://www.monde-diplomatique.fr/en/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue May 25 1999 - 01:58:52 PDT