Re: [htdig] Re: accents mapping


Subject: Re: [htdig] Re: accents mapping
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Feb 25 2000 - 14:05:56 PST


According to Robert Marchand:
> what are the step to create a new fuzzy algorithm?
> I mean, apart from create a new class, what need to be changed
> in order to register a name to be use in the configurations files?

Well, the 3 examples I'd suggest are endings or synonyms, for
static (dictionary-based) databases, soundex or metaphone for dynamic
(document-based) databases, and substring for run-time (no extra database)
methods.

Let's assume you use soundex as the example to follow. Look for soundex
in the source (grep -i soundex */*.{h,cc}). Of course, that will turn
up in the Soundex class, Soundex.{h,cc}, so you can create new files for
your new class, using these as a starting point. The other places are:
Fuzzy.cc and htfuzzy.cc, both in the htfuzzy directory.

In Fuzzy.cc, you want to add an include for the header file
for the new class, as well as the check for the new algorithm in
Fuzzy::getFuzzyByName(), which will be used by htsearch. In htfuzzy.cc,
you want to also add an include for the new header file, as well as
the check for the new algorithm in main(), and the extra description
in usage(), but you only add your new algorithm to htfuzzy.cc if it
requires its own database to be built. You'll note there's no reference
to substring or prefix in htfuzzy.cc.

Any config attributes used should be defined in htcommon/defaults.cc.
For 3.2.x, the documentation for these attributes should be included
in the defaults.cc entries, while for 3.1.x, they should be added
to htdoc/attrs.html, and referenced in htdoc/cf_byname.html and
htdoc/cf_byprog.html. Don't forget to update the documentation
entry for search_algorithm too. Finally, you should also add notes in
htdoc/htfuzzy.html (for additions to htfuzzy.cc), and htdoc/require.html,
as well as installdir/htdig.conf.

Of course, to build the new class, you'll need entries in
htfuzzy/Makefile. Note that this file is automatically built by
./configure, so you should modify htfuzzy/Makefile.in instead, and rerun
configure, or modify both manually. You need to add the object file name
for your new class to LIBOBJS, and to OBJS too if used in htfuzzy.cc.
In 3.2.x, even Makefile.in is automatically generated by automake,
from Makefile.am, so that's where your changes should go for 3.2.x.
There, you must add the name of your new header file to noinst_HEADERS,
and the new .cc source file to libfuzzy_la_SOURCES.

> do the main htsearch also nee to be changed ?

That's done by changing htfuzzy/Fuzzy.cc, and htfuzzy/Makefile.in (or .am).

> Is there documentation for this process?

I just wrote it. :-) No, I couldn't find anything else.

> for the record the modifications I've done in WordList and parser seem
> to work and it was pretty easy but there are problems and I want to
> take a look at the 'fuzzy way' which is certainly more elegant.

One big advantage of the fuzzy algorithms is they can be applied after
the fact, and enabled or disabled at will. By treating accent removal
like case mapping, it means all words are indexed in their stripped form,
so you can't later request an exact match. This is currently the way
things work for upper- vs. lower-case spellings, but I think in practise
that's less of a problem.

> One problem I've seen with my approach is that the endings database is
> untouched so a search for "UniversitÚ" is expanded in (universitÚ or
> universitÚs) while a search for "Universite" is not. This was the case
> before the patch but it is more apparent now. I'm not sure the fuzzy
> algorithm would cure it unless fuzzy algorithms are applied on each
> others.

That's a very good point. It would appear that fuzzy algorithms are not
applied cummulatively. I tried a search for "acknowledgement", and it
came back with Search results for '(acknowledgement or acknowledgment
or acknowledgements)'. Note the absence of the word acknowledgments,
which would have required an application of both the synonyms and
the endings fuzzy match, with endings following synonyms (because the
synonyms database doesn't have any plurals). Also, if you start with
a plural form, synonyms expansion doesn't work, so really you'd need to
apply all algorithms to each other.

The accents fuzzy algorithm would add an interesting wrinkle to this, in
that you may begin with words that are misspelled, and therefore not in
the endings dictionary, so you'd want accents applied first, to pick up
the spellings actually used, then apply endings to get all the variations
of the word, then you'd probably want to run these through accents
again, in case it would pick up other misspelled entries in documents.
Maybe there should be an option to get the fuzzy algorithms to be both
cummulative and iterative, so it would continue applying and reapplying
algorithms to the growing list of search words until nothing new turns
up, or until all words have had all algorithms applied to them, if we
keep track of them. I guess the weight of a word would be the product
of all algorithms applied to get the word.

> P.S.: I have my patches available should anybody want to look at them.

Feel free to post them to the list. I'd be curious to see them.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Feb 25 2000 - 14:09:50 PST