Status and Projects (long, not critical)


Geoff Hutchison (Geoffrey.R.Hutchison@williams.edu)
Mon, 20 Jul 1998 06:06:18 -0400 (EDT)


Andrew,

I had a few brainstorms this weekend, in part due to some research on new
search techniques. I'll fill you in on the status of htdig3 and some
projects.

1) Contributed patches: I'm steadily reviewing patches I have from the
list. I hope to get the clearly correct patches this week and then I'll go
to the more questionable ones.

2) Bugs: I discovered a few bugs:
    a) /bin/sort difficulty on Linux
    b) HTML tags are sometimes incorrectly converted to spaces
    c) HTML.cc: doindex=0 doesn't remove doc from DB, it just has an empty
       excerpt. (Perhaps the easiest fix is to remove these in htmerge.)
    d) META description (my patch) doesn't work completely correctly. In
       reality a new field should be created and used by htsearch.

3) Compressed HTML: It's not easy to add this to Apache. I submitted a
feature request to GNATS.

4) New Fuzzy searches: Besides the request for a regex fuzzy, I plan on
adding two:
    a) speling: (sic) Similar to the tcsh, bash and Apache
spelling-correction. I have some code, but I'll also look at ispell.
    b) trigram: (see <http://www.heise.de/ct/english/9704386/>) A trigram
is a three character substring. By creating a trigram db from the word db,
a search term can be statistically compared to the trigram db to see what
words are similar. The c't article has details and samle code.
   Additionally, with a trigram DB, substring searches can be limited to a
subset of the full word DB (since the word must have all the trigrams of
the search substring to match). This should provide a significant speed
boost. I'm also wondering about the agrep code and the "academic citation"
copyright on it. (i.e. would it be useful? can we use it in a GPL?)

5) Phrase searching: This has been requested several times and I didn't
know how to do this in a reasonable time or space trade-off. But if
there's an n-gram (say 6 characters) database of the documents *including*
spaces, then we can match word boundries. This would require some testing
for reliability and disk requirements. (I'm betting it will be approx. the
word DB size and fairly reliable.)

6) Word frequency: One feature SWISH has that I miss is elimination of
frequent words. But if the word frequency is factored into the score, then
this is unneccessary. So a search for "the foobar" would count matches for
"foobar" as more important than "the" since the latter is much more
frequent.

7) Duplicate URLs: This is mainly worked out, but one note are "ETag"
header lines I see from Apache. I think they're checksums of some sort,
which could minimize the work htdig has to do.

I'm quite happy with the direction of 4-6. Each will require some work and
testing, but I think they'll improve the searching ability significantly.

-Geoff



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:53 PST