Subject: RE: [htdig3-dev] Re: ParseTree implementation
From: Geoff Hutchison (firstname.lastname@example.org)
Date: Tue Aug 22 2000 - 05:44:18 PDT
Thanks for your message--I've been hoping for this discussion for a
while, but it never seemed to come up. (Yet there are 100+ people on
this list, don't you have opinions?)
At 12:07 PM +0200 8/22/00, Quim Sanmarti wrote:
>If parser and resulting query are independent, different parsers can be
>written to generate queries composed by predefined query elements. This is
>customary in compiler/interpreter construction, where the target object
>code is defined independently from the parser (usually before :).
I actually never said the queries had to be done in the ParseTrees--I
simply wanted them as a "holding place" for results so that
intermediate results can be cached. The actual querying of the terms
is probably better done elsewhere. That code hasn't been written
because I'm still finding bugs in the ParseTree classes. :-(
That said, I'm always looking for feedback. Since the original
discussions in February, I've received little (if any) feedback. I
haven't had any since I actually started putting up code. So I'm glad
to see your message.
>2. I observe that a ParseTree query is implemented by a n-ary tree.
>Wouldn't it be clearer and simpler to use just unary and binary nodes?
>These are in fact the ones that will be used by query strings. Moreover,
>when evaluating a query, result sets will actually be filterered operator
>by operator in binary/unary fashion.
Not really. Remember an "and" query is really an n-ary operation.
Furthermore, it's often more efficient to walk several results at
once rather than pairwise. Think about walking an AND query of 4-5
queries with a table of each document. To combine all at once, you
just check to make sure it's in each document and add it to the
global result set. This is faster than log(n) combinations since you
don't re-traverse the intermediate values. But I think this is a
>I'd like to propose the following (not necessarily complete) class
Things like host: url: link: and (of course) field: are to be taken
care of in the query structure. My feeling was that this should go
into WeightWord, which would separate any sort of OP:query structure
and set field bits accordingly.
>operators And, Except, Near, Next, there's no need of evaluating the second
>operand if the first yields no results. This will be a major optimization.
>I suppose that this is already the idea and has been discussed elsewhere,
>but since the related code is not yet written when I write this, I felt i'd
>have to remark it.
Yes. The other major optimization is that each node in the ParseTree
can cache the results using an appropriate key (using the
initialQuery and type perhaps).
>To be able to implement these queries, I propose to modify the DocMatch
>class, in order to include a list (vector?) containing the match locations
>for the corresponding docId. All operators will have to deal with these
This would be fine--it's probably more efficient to do phrase
searching this way. Plus we could implement Google-style proximity
weighting if we desire (i.e. later).
>5. I see the application of Fuzzy filters to lookups as an open issue.
>Current fuzzy algos can be modeled as operators that result in a 'or' query
>of single words. Is this optimum? Wouldn't it be perferable to have 'leaf'
>queries (inherited from WordQuery) that do 'specialized' lookups according
>to each fuzzy algo, then merged with a single 'or'? Dunno.
My idea was that Fuzzy would be passed down the tree so that
individual types (i.e. phrase/Exact) can ignore all or some fuzzies.
When they hit the leafs, these would become OrParseTree objects and
add the appropriate alternates. I'm open to suggestions on that.
>6. (I'm not sure whether the following applies only to htsearch)
>Altavista disposes of a powerful hypenated-word query similar to the phrase
>query, so that:
> foo-bar is equivalent to "foo bar" or "foo-bar" or "foobar"
This could be a useful fuzzy algorithm. Right now they're not
phrase-aware. my original idea is that Fuzzy objects can return a
ParseTree of alternatives, so they can return phrases, normal
OrParseTrees, etc. This would help for the synonym fuzzy too.
>Is this feasible for our dear digger?
It's not the digger in question, is it? :-)
To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Aug 22 2000 - 09:17:21 PDT