Subject: [htdig3-dev] Architecture Overview: htsearch parsing revisited
From: Geoff Hutchison (email@example.com)
Date: Wed Mar 29 2000 - 20:56:23 PST
Well, I've put this off for a while, especially as I try to wrap up
some of the loose ends for 3.2.0b2. But since I can't connect to the
PPP server (the line's busy), I have an excuse to write offline. :-)
Let's take a look at the code in htsearch. To be perfectly honest,
the division of labor isn't exactly fair. Most of it is done in three
files: htsearch.cc, parser.cc and Display.cc. Since the first isn't a
class, it should give up some of its code and maybe we can split the
others into a few additional classes. Please forgive me, the article
is going to be less of an overview of the current code and more of a
loose proposal of where htsearch should (?) go.
For the time being, we'll stick to the idea of one query per process.
To extend the current model of three methods (and/all, or/any,
boolean), I agree with a suggestion and propose a fourth: exact (i.e.
the whole query is treated as a phrase search). Now however the query
is formatted, we will need to do some transforming and parsing. I
think the current process of turning everything into a boolean
expression is the right way. So I propose instead of the "" operators
used now (which I consider a kudge that I wrote) we add a "near"
operator with a parameter defining how close the words must be to
Method: And, Query: For score and seven => ((For & score) & and) & seven
Method: Or, Query: "Geoff Hutchison" code => (Geoff near Hutchison) or code
Method: Exact, Query: search engine codebase =>
(search near engine) near codebase
Method: Boolean, Query: (Gilles near Geoff) and ht://Dig => [same]
I'll also suggest that the transforming will allow a unary not
operator that functions transforms as follows:
not Microsoft => * not Microsoft
So in our parsed, transformed expression, we'll have a few
"operators," namely (, ), &, |, !, *, and ~ (the last three being
NOT, ALL, and NEAR respectively).
Now each word token at a minimum also needs to keep track of its
"fuzzy factor," and if necessary the "field" to search. (We can treat
this simply as a mask for the flags.)
What I've just described is IMHO some requirements for a Parser
class--it transforms the query into an expression tree. Then a
Searcher class would walk the tree, iterating over a Collection to
return a ResultList. IMHO, the scoring, sorting and "winnowing"
(removing deleted documents, those not matching restrict or exclude
clauses, etc.) should be done inside this ResultList class, with some
of the work like initial scoring being done by the Searcher class.
This would leave the work of filling in the templates themselves to
the Display class, which seems fitting. That's not to say there isn't
a lot of work to be done for this, what with hilighting, finding
anchors, SGML and URL encoding, etc.
You'll note that I've been rather hazy on some important issues.
IMHO, it's very tricky to walk the expression tree optimally. After
all, you would rather not waste the memory on having the results of
all the searches at one time. However, if you do the searches
pairwise, you waste time against comparing the lists all at once. But
that's something we can talk about later.
To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Wed Mar 29 2000 - 20:05:12 PST