[htdig] [patch] accents fuzzy match algorithm for 3.1.5


Subject: [htdig] [patch] accents fuzzy match algorithm for 3.1.5
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Mar 10 2000 - 11:39:44 PST


This is the latest update to Robert Marchand's accents patch, which merges
together the last three patches I posted. It includes the fix to htsearch
for parsing search_algorithm correctly in locales that use a comma as
a decimal point, as well as the kludge to support truncated words.
(If you've already applied the 3 previous patches, you don't need to use
this one. I'm posting it for the convenience of those who wish to try
this out, but may have been confused by the flurry of e-mails about the
earlier patches.)

This patch is for 3.1.5. You should be able to apply it with
"patch -p1 < this_file" while in the main source directory. I made
two changes to Robert's code. First, when using the characters as
subscripts into the MinusculeISOLAT1 array, it's necessary to cast them
to unsigned char, or this will break on systems where chars are signed
by default. I also made a kludgy fix to support truncating search words
to maximum_word_length, to properly match similarly truncated words in
the database. I'm not wild about the external reference to config,
while other methods in this class have the config object passed to
them, but it should get the job done. I'd still recommend increasing
maximum_word_length to avoid this problem altogether.

When you change your locale to one that uses a different format for
floating point numbers (i.e 0,5 instead of 0.5), then you must change
any floating point attribute definitions in your config file to use this
floating point format. This can affect any of the *_factor attributes, as
well as the search_algorithm attribute, on any system in which the atof()
function is locale-aware, as is the case on Linux systems where atof()
simply calls strtod(). Without this change, the floating point numbers
will be read as integers, so 0.5 will be treated as 0. If htsearch
thinks the weight is 0 for any fuzzy match algorithm, it won't highlight
the search words in the excerpt, even though, oddly enough, it did seem
to find those words. (I guess it would affect the ranking, though.)
This patch also fixes htsearch not to use commas as string list separators
in parsing search_algorithm, so the comma can be used in numbers.

This supports accents only in the ISO-8859-1 (Latin 1) character set.

diff -c3prN htdig-3.1.5/htcommon/defaults.cc htdig-3.1.5.accents/htcommon/defaults.cc
*** htdig-3.1.5/htcommon/defaults.cc Thu Feb 24 20:29:10 2000
--- htdig-3.1.5.accents/htcommon/defaults.cc Thu Mar 2 11:20:55 2000
*************** ConfigDefaults defaults[] =
*** 27,32 ****
--- 27,33 ----
      //
      // General defaults
      //
+ {"accents_db", "${database_base}.accents.db"},
      {"add_anchors_to_excerpt", "true"},
      {"allow_in_form", ""},
      {"allow_numbers", "false"},
diff -c3prN htdig-3.1.5/htfuzzy/Accents.cc htdig-3.1.5.accents/htfuzzy/Accents.cc
*** htdig-3.1.5/htfuzzy/Accents.cc Wed Dec 31 18:00:00 1969
--- htdig-3.1.5.accents/htfuzzy/Accents.cc Thu Mar 2 16:33:10 2000
***************
*** 0 ****
--- 1,179 ----
+ //
+ // Accents.cc
+ //
+ // Implementation of Accents
+ //
+ //
+ //
+ #if RELEASE
+ static char RCSid[] = "$Id: $";
+ #endif
+
+ #include "Configuration.h"
+ #include "htconfig.h"
+ #include "Accents.h"
+ #include "Dictionary.h"
+ #include <ctype.h>
+ #include <fstream.h>
+
+ extern int debug;
+
+ /*---------------------------------------------------------------.
+ | Ajoute par Robert Marchand pour permettre le traitement adequat de |
+ | l'ISO-LATIN (provient du code de Pierre Rosa) |
+ `---------------------------------------------------------------*/
+
+ /*--------------------------------------------------.
+ | table iso-latin1 "minusculisee" et "de-accentuee" |
+ `--------------------------------------------------*/
+
+ static char MinusculeISOLAT1[256] = {
+ 0, 1, 2, 3, 4, 5, 6, 7,
+ 8, 9, 10, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 20, 21, 22, 23,
+ 24, 25, 26, 27, 28, 29, 30, 31,
+ 32, 33, 34, 35, 36, 37, 38, 39,
+ 40, 41, 42, 43, 44, 45, 46, 47,
+ 48, 49, 50, 51, 52, 53, 54, 55,
+ 56, 57, 58, 59, 60, 61, 62, 63,
+ 64, 'a', 'b', 'c', 'd', 'e', 'f', 'g',
+ 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
+ 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
+ 'x', 'y', 'z', 91, 92, 93, 94, 95,
+ 96, 'a', 'b', 'c', 'd', 'e', 'f', 'g',
+ 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
+ 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
+ 'x', 'y', 'z', 123, 124, 125, 126, 127,
+ 128, 129, 130, 131, 132, 133, 134, 135,
+ 136, 137, 138, 139, 140, 141, 142, 143,
+ 144, 145, 146, 147, 148, 149, 150, 151,
+ 152, 153, 154, 155, 156, 157, 158, 159,
+ 160, 161, 162, 163, 164, 165, 166, 167,
+ 168, 168, 170, 171, 172, 173, 174, 175,
+ 176, 177, 178, 179, 180, 181, 182, 183,
+ 184, 185, 186, 187, 188, 189, 190, 191,
+ 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c',
+ 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i',
+ 208, 'n', 'o', 'o', 'o', 'o', 'o', 'o',
+ 'o', 'u', 'u', 'u', 'u', 'y', 222, 223,
+ 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c',
+ 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i',
+ 240, 'n', 'o', 'o', 'o', 'o', 'o', 'o',
+ 'o', 'u', 'u', 'u', 'u', 'y', 254, 255};
+
+
+ //*****************************************************************************
+ // Accents::Accents()
+ //
+ Accents::Accents()
+ {
+ name = "accents";
+ }
+
+
+ //*****************************************************************************
+ // Accents::~Accents()
+ //
+ Accents::~Accents()
+ {
+ }
+
+ //*****************************************************************************
+ // int Accents::writeDB(Configuration &config)
+ //
+ int
+ Accents::writeDB(Configuration &config)
+ {
+ String var = name;
+ var << "_db";
+ String filename = config[var];
+
+ index = Database::getDatabaseInstance();
+ if (index->OpenReadWrite(filename, 0664) == NOTOK)
+ return NOTOK;
+
+ String *s;
+ char *fuzzyKey;
+
+ int count = 0;
+
+ dict->Start_Get();
+ while ((fuzzyKey = dict->Get_Next()))
+ {
+ s = (String *) dict->Find(fuzzyKey);
+
+ // Only add if meaningfull list
+ if (mystrcasecmp(fuzzyKey, s->get()) != 0) {
+
+ index->Put(fuzzyKey, *s);
+
+ if (debug > 1)
+ {
+ cout << "htfuzzy: '" << fuzzyKey << "' ==> '" << s->get() << "'\n"
+ ;
+ }
+ count++;
+ if ((count % 100) == 0 && debug == 1)
+ {
+ cout << "htfuzzy: keys: " << count << '\n';
+ cout.flush();
+ }
+ }
+ }
+ if (debug == 1)
+ {
+ cout << "htfuzzy:Total keys: " << count << "\n";
+ }
+ return OK;
+ }
+
+
+ //*****************************************************************************
+ // void Accents::generateKey(char *word, String &key)
+ //
+ void
+ Accents::generateKey(char *word, String &key)
+ {
+ extern Configuration config;
+ static int maximum_word_length = config.Value("maximum_word_length", 12);
+
+ if (!word || !*word)
+ return;
+
+ String temp(word);
+ if (temp.length() > maximum_word_length)
+ temp.chop(temp.length()-maximum_word_length);
+ word = temp.get();
+ key = '0';
+ while (*word) {
+ key << MinusculeISOLAT1[ (unsigned char) *word++ ];
+ }
+ }
+
+
+ //*****************************************************************************
+ // void Accents::addWord(char *word)
+ //
+ void
+ Accents::addWord(char *word)
+ {
+ if (!dict)
+ {
+ dict = new Dictionary;
+ }
+
+ String key;
+ generateKey(word, key);
+
+ String *s = (String *) dict->Find(key);
+ if (s)
+ {
+ // if (mystrcasestr(s->get(), word) != 0)
+ (*s) << ' ' << word;
+ }
+ else
+ {
+ dict->Add(key, new String(word));
+ }
+ }
+
diff -c3prN htdig-3.1.5/htfuzzy/Accents.h htdig-3.1.5.accents/htfuzzy/Accents.h
*** htdig-3.1.5/htfuzzy/Accents.h Wed Dec 31 18:00:00 1969
--- htdig-3.1.5.accents/htfuzzy/Accents.h Thu Mar 2 11:24:56 2000
***************
*** 0 ****
--- 1,30 ----
+ //
+ // Accents.h
+ //
+ // $Id: $
+ //
+ //
+ #ifndef _Accents_h_
+ #define _Accents_h_
+
+ #include "Fuzzy.h"
+
+ class Accents : public Fuzzy
+ {
+ public:
+ //
+ // Construction/Destruction
+ //
+ Accents();
+ virtual ~Accents();
+
+ virtual int writeDB(Configuration &config);
+
+ virtual void generateKey(char *word, String &key);
+
+ virtual void addWord(char *word);
+
+ private:
+ };
+
+ #endif
diff -c3prN htdig-3.1.5/htfuzzy/Fuzzy.cc htdig-3.1.5.accents/htfuzzy/Fuzzy.cc
*** htdig-3.1.5/htfuzzy/Fuzzy.cc Thu Feb 24 20:29:10 2000
--- htdig-3.1.5.accents/htfuzzy/Fuzzy.cc Thu Mar 2 11:22:14 2000
*************** static char RCSid[] = "$Id: Fuzzy.cc,v 1
*** 13,18 ****
--- 13,19 ----
  #include "Configuration.h"
  #include "List.h"
  #include "StringList.h"
+ #include "Accents.h"
  #include "Endings.h"
  #include "Exact.h"
  #include "Metaphone.h"
*************** Fuzzy::getFuzzyByName(char *name)
*** 171,176 ****
--- 172,179 ----
          return new Soundex();
      else if (mystrcasecmp(name, "metaphone") == 0)
          return new Metaphone();
+ else if (mystrcasecmp(name, "accents") == 0)
+ return new Accents();
      else if (mystrcasecmp(name, "endings") == 0)
          return new Endings();
      else if (mystrcasecmp(name, "synonyms") == 0)
diff -c3prN htdig-3.1.5/htfuzzy/Makefile.in htdig-3.1.5.accents/htfuzzy/Makefile.in
*** htdig-3.1.5/htfuzzy/Makefile.in Thu Feb 24 20:29:10 2000
--- htdig-3.1.5.accents/htfuzzy/Makefile.in Thu Mar 2 11:23:48 2000
*************** include $(top_builddir)/Makefile.config
*** 10,20 ****
  OBJS= Endings.o EndingsDB.o Exact.o \
                  Fuzzy.o Metaphone.o Soundex.o \
                  SuffixEntry.o Synonym.o htfuzzy.o \
! Substring.o Prefix.o
  
  LIBOBJS= Endings.o Exact.o Fuzzy.o Metaphone.o \
                  Soundex.o Synonym.o EndingsDB.o SuffixEntry.o \
! Substring.o Prefix.o
  
  TARGET= htfuzzy
  LIBTARGET= libfuzzy.a
--- 10,20 ----
  OBJS= Endings.o EndingsDB.o Exact.o \
                  Fuzzy.o Metaphone.o Soundex.o \
                  SuffixEntry.o Synonym.o htfuzzy.o \
! Substring.o Prefix.o Accents.o
  
  LIBOBJS= Endings.o Exact.o Fuzzy.o Metaphone.o \
                  Soundex.o Synonym.o EndingsDB.o SuffixEntry.o \
! Substring.o Prefix.o Accents.o
  
  TARGET= htfuzzy
  LIBTARGET= libfuzzy.a
diff -c3prN htdig-3.1.5/htfuzzy/htfuzzy.cc htdig-3.1.5.accents/htfuzzy/htfuzzy.cc
*** htdig-3.1.5/htfuzzy/htfuzzy.cc Thu Feb 24 20:29:11 2000
--- htdig-3.1.5.accents/htfuzzy/htfuzzy.cc Thu Mar 2 11:23:12 2000
*************** static char RCSid[] = "$Id: htfuzzy.cc,v
*** 43,48 ****
--- 43,49 ----
  
  #include "htfuzzy.h"
  #include "Fuzzy.h"
+ #include "Accents.h"
  #include "Soundex.h"
  #include "Endings.h"
  #include "Metaphone.h"
*************** main(int ac, char **av)
*** 108,113 ****
--- 109,118 ----
          {
              wordAlgorithms.Add(new Metaphone);
          }
+ else if (mystrcasecmp(av[i], "accents") == 0)
+ {
+ wordAlgorithms.Add(new Accents);
+ }
          else if (mystrcasecmp(av[i], "endings") == 0)
          {
              noWordAlgorithms.Add(new Endings);
*************** usage()
*** 237,242 ****
--- 242,248 ----
      cout << "Supported algorithms:\n";
      cout << "\tsoundex\n";
      cout << "\tmetaphone\n";
+ cout << "\taccents\n";
      cout << "\tendings\n";
      cout << "\tsynonyms\n";
      cout << "\n";
diff -c3prN htdig-3.1.5/htsearch/htsearch.cc htdig-3.1.5.accents/htsearch/htsearch.cc
*** htdig-3.1.5/htsearch/htsearch.cc Thu Feb 24 20:29:11 2000
--- htdig-3.1.5.accents/htsearch/htsearch.cc Mon Mar 6 13:13:00 2000
*************** setupWords(char *allWords, List &searchW
*** 475,481 ****
      // configuration attribute.
      // For algorithms other than exact, we need to also do word lookups.
      //
! StringList algs(config["search_algorithm"], " \t,");
      List algorithms;
      String name, weight;
      double fweight;
--- 475,481 ----
      // configuration attribute.
      // For algorithms other than exact, we need to also do word lookups.
      //
! StringList algs(config["search_algorithm"], " \t");
      List algorithms;
      String name, weight;
      double fweight;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Mar 10 2000 - 11:44:48 PST