Re: [htdig] problems with accents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 20 May 1999 10:59:14 -0500 (CDT)


According to Philippe Riviere:
> >>1) some browsers don't like URLs containing accentuated letters (it would
> >>be better to have them escaped). This happens in the results page when your
> >>search of an accentuated word yields many results : the 1 2 3 4 5 next
> >>links contain accents
> >
> >It would certainly be better to not have accentuated letters in URLs
> >in general. IMHO this is more a matter of proper naming of document
> >files than of having search engines recognizing them. I'd bet you'll
> >go into trouble with that with more than just ht://Dig..
>
> True. But htsearch itself generates URLs poiting back to itself ("go to
> next page of results") and should not use accents in these.

Yes, the problem here is that the accented letters in the "words"
input parameter don't get URL-encoded when htsearch creates the URLs
for the links to other search results pages. Right now, the test in
encodeURL() is:

        if (isdigit(*p) || isalpha(*p) || strchr(valid, *p))

but it should probably be:

        if (isascii(*p) &&
          (isdigit(*p) || isalpha(*p) || strchr(valid, *p)))

Anything that fails the test should be URL-encoded as %xx. There was
also some discussion about a month ago as to what constitutes a "valid"
character. It seems that what htsearch encodes is not the same set as
what Netscape Navigator encodes. Someone also quoted what the standard
should be, and it was another set altogether. I was going to change the
valid set in htsearch (it's still on my to-do list), but I've forgotten
what the standard set is, and where I can find it. Time to search the
archives, I guess.

> >>2) searching "Útude" does not yield "etudes" and vice-versa. I'd prefer
> >>it to.
> >>
> >
> >Look at ht://Dig documentation, set your locale to a proper value
> >(probably fr_FR), get a french dictionary and affix rule file for
> >the endings algorithm and re-index your site.
>
> locale is currently set to fr_FR ; is there something else to add ?

As Torsten pointed out, the endings algorithm will handle the matching
of singular and plural forms, but you must run "htfuzzy endings" to
build your endings database from the dictionary files.

The matching of accented and unaccented characters is another matter, and
is a problem that has been discussed here before, but never completely
resolved, as far as I can remember. Torsten's suggestion of using
the soundex or metaphone htfuzzy algorithms may be worth a try!
Those databases must be rebuild every time you reindex your site.
If those work for you, then great. It may be, though, that either they
don't treat accented and unaccented letters as having the same sound, or
it may lead to to many false matches if the sound matching is too vague.

A second option, which I think had been suggested previously, is to use
the synonyms algorithm. That would require adding pairs like "etude
Útude" in the synonyms file, and running "htfuzzy synonyms". It would
be tedious to get all possible accented words in the file in this way,
but you could start with the most common ones and add them as needed
(by tracking searches in the log files).

A third option, which would require some code development, would be to
add a new "accents" fuzzy algorithm that would build up a database of
matching accented and unaccented letters - which is of course dependent
on both character set and language. For instance, it's my understanding
that in Swedish, "÷" and "o" are separate letters, which should not be
treated as equivalent, but in French you'd want them to be equivalent.
(Correct me if I'm wrong.) Is there merit in implementing this new
fuzzy algorithm? That may depend on how well or poorly soundex and
metaphone do the job. Can anyone else suggest a better approach?

> >>* I patched Display.cc for a presentation glitch (in my view) : the 1 2 3 4
> >This patch will mess up the displayed search results on non-graphical
> >browsers like Lynx.
>
> I do pay attention to the lynx display (mostly for vision-deficient
> customers). And this patch does mess (as far as I have noticed) with the
> lynx browser. See for yourself at http://www.monde-diplomatique.fr/

A non-graphical browser-friendly approach to this problem would be to put
spaces in the alt text in the <img> tags.

I think it might make sense to add a page_number_separator attribute,
which would be a space by default, to handle situations like this more
elegantly. Thoughts?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu May 20 1999 - 08:11:08 PDT