Re: [htdig] new to htdig


Subject: Re: [htdig] new to htdig
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Feb 23 2000 - 09:03:30 PST


According to Robert Marchand:
> we have finished a first phase of search engines evaluation for
> indexing our domain (umontreal.ca), and ht://Dig seems to be the
> winner.
>
> Although, there are some issues that need to be resolved before we
> adopt it.
>
> 1) We badly need the 'fuzzy accent' algorithm or whatever the solution
> would be to be able to search a word with and without accents: like
> "Montréal" and "Montreal" and get the same results. This is very
> important for us. I've look at some discussion on this topic here and
> would like to know if it is soon to be released. If not, then we will
> have to find a quick-and-dirty solution like patch some files by
> ourselves.
>
> I've look a little in the code (not a C++ expert) and I understand that
> it would need several patches to have the following requirements:
>
> - search either "Montréal" or "Montreal" and get all the occurences as if
> someone had typed "Montreal or Montréal".
>
> - hilite the word that was search.
>
> I know the code does it for the min/maj cases. Could the same be done
> for accents?

The upper to lower case mapping is a little different in that it's
handled pretty much the same way for all languages using an ISO-Latin x
encoding, and the locale, if properly defined, gives the mapping.
Locales don't give any information for mapping accented to unaccented
letters, so that information will have to be provided elsewhere - either
hardcoded Latin 1 mappings in the code (which would limit its usefulness),
or configurable by the user.

Also, as we've discussed previously in many threads, it would be much
better to implement the accent handling as a fuzzy match, rather than
like the case mapping. As you've realised yourself, patching the code
elsewhere would require many, many changes in many parts of the code,
even for a quick-and-dirty solution, so you'd probably end up doing more
work than it would take to write one new fuzzy match algorithm, with
less satisfactory results that would be less likely to be incorporated
into the distribution source.

Geoff suggests basing in on the Substring fuzzy algorithm, but the more
I think about it, the more I think it should be done like soundex and
metaphone. Consider the similarities: soundex and metaphone look at all
the indexed words, and build a database of these words in a canonical
form that represents how the word sounds, mapping a given word "sound"
to all the forms and spellings in the indexed documents that yield that
sound. The accents algorithm is the same idea, except instead of going
by sounds, the canonical form is the unaccented equivalent of the word.
Most of the fuzzy match infrastructure is already in place, so by adding
an algorightm, you automatically get the search for all alternate forms
of a word, in both the database and the excerpt highlighting. Plus the
feature can be selected at run time via the search_algorithms attribute.

The actual canonicalisation of words in that accents algorithm could be
even simpler than the soundex or metaphone algorithms. You could do it
with a lookup table that gives one-to-one character mappings. However,
to provide the added flexibility of of one-to-many, many-to-one and
many-to-many mappings as well, as we had discussed in a couple earlier
threads, would make this algorithm even more useful, without adding
much complexity. The HtWordCodec code that Hans-Peter developed could
do all of these string mappings for us. I envision using it very much
like how the url_part_aliases mappings are handled. When you look at
it this way, implementing this really involves glueing together existing
pieces, so it shouldn't be that hard, and then configuring it with a
new config attribute that contains all the accent to plain letter mappings,
as string pairs in a string list.

There have been a few kludgy approaches to mapping accents before, but
none did a good job of it (one of them involved blindly zapping data in
one of the .db files, regardless of whether it was text or not), and so
none were good enough to include in the distributions. I really don't
think it would be much more effort to implement a proper fuzzy algorithm,
and the end result would be so much better.

> 2) We have a problem with robots.txt and the database. It seems that if
> the file robots.txt is modified or added after a complete reindex from
> scratch and BEFORE an update reindex, some files that are now no more
> accepted are keeped in the database. Does it means that a complete
> reindex has to be done after a change in a robots.txt? That seems a bit
> harsh. We have no control over all the sites to index.
>
> Am I wrong? Is this a bug?

Hmm. Changing robots.txt after a spider crawled your site is sort of like
closing the barn door after the horses have left. Still, I can see what
you're getting at. Right now, htdig has fairly limited means of purging
documents from a database, and this is a frequent complaint. I can see the
usefulness of being able to check all existing URLs in the database against
their server's robots.txt (as well as exclude_urls and limit_urls_to), and
tossing out URLs that are now disallowed. It would be a nice feature to
add, but perhaps it should be optional, as some users may not like that
behaviour.

> I'll appreciate responses!
> We're using release version 3.1.4.

If you do go ahead and make modifications (especially if you decide to
write the accents fuzzy algorithm), you may want to look at the 3.2.0b1
beta, or wait a little longer for the 3.2.0b2 beta. If that code works
for you, it would be preferable to add your changes to it rather than
3.1.x, which is really in maintenance mode right now. However, it's
your choice, as porting a fuzzy algorithm from 3.1.x to 3.2, or back,
should be pretty easy to do.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 23 2000 - 09:07:15 PST