Re: [htdig] Precise Fuzziness


Subject: Re: [htdig] Precise Fuzziness
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Nov 30 1999 - 08:51:13 PST


According to Dave Melton:
> As I mentioned, there are two main systems for transcribing
> Japanese words into English characters. One system is much more
> common than the other, and produces spellings that make (I think)
> a little more sense to the native English speaker. I've just
> received 600+ HTML files that use the other system. I don't want
> to take the time to change all of these files, and the person who
> sent them to me doesn't want them changed. On the other hand,
> most of the site's users are far more used to the more common
> spellings of person and place names.
>
> It would be ideal if I could provide a short list of acceptable
> alternate spellings...my "precise fuzziness". The number of
> required substitutions is actually pretty small...the following
> should cover it:
>
> For "o", accept "o", "oo", or "o'o"
> For "u", accept "u", "uu", or "u'u"
> For "n", accept "n" or "n'"
> For "zu", accept "zu", "tsu", or "dzu"
>
> All of this could, of course, be accomplished by substituting
> some regular expression logic into the search string.
>
> One common example is the spelling of Japan's largest city. A
> user would want to search for "Tokyo", but would need it to
> match "Tokyoo" in the alternate spelling HTML files.
>
> I'd love to find a simple way to do this. I haven't looked
> into the sources at all...I'd rather not go that way if I don't
> have to. On the other hand, if it's possible to build a "custom
> fuzzy", that might be an option.

I see this as something similar to, or an extension of the concept of
the proposed "accent" fuzzy algorithm. In the accent algorithm, after
digging and merging, you'd go through the words, as is done with soundex
and metaphone, and build a new database of words with the accents stripped
off. This would be done according to pre-defined or user-defined mappings
of accented to unaccented letters, to put the words into a canonical form.

In htsearch, the search words would be canonicalised using the same rules,
to allow database lookups of the canonical forms. When we discussed
this previously, it was proposed that the canonical search of the word
database be done on the fly, as for prefix and substring algorithms,
but after giving this more though, I think using a database like the
metaphone and soundex algorithms makes more sense and would be quicker.
Unlike prefix and substring, where you can't predetermine the set of
substrings that will be searched, with accented words there is only one
canonical form, just like for metaphone and soundex.

The idea of user-defined mappings is to allow customisation for different
languages, and ideally this would allow one-to-many and many-to-one
mappings as well. E.g., you might want to map both "oe" and "" to
"o" for German and Scandinavian. If the algorithm were designed to be
flexible enough, it could handle your definitions for canonicalising
Romaji as well.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Tue Nov 30 1999 - 09:03:40 PST