Subject: RE: [htdig3-dev] htsearch rewrite
From: Geoff Hutchison (email@example.com)
Date: Mon Sep 04 2000 - 12:49:29 PDT
At 12:43 PM +0200 9/4/00, Quim Sanmarti wrote:
>BTW, is there any project standard concerning comments, indentation,
>naming, etc that I must use?
Not really. Commenting often is something I've been pushing, esp.
comments at the beginning of procedures. But so far most people have
contributed fairly clear code. As you've probably worked out we're
pretty open about it. It's not like we have tons of code to turn away
(v. projects like gcc, apache, linux-kernel, etc.)
>Yessir. The question now is to define the cache policy and its parameters.
>Size? Expiration by age, LRU? What else?
Size should definitely be configurable (so that some can even turn it
off). Expiration by age might not be a bad first approach, but
eventually we may want to let this be configurable too. Certainly if
we make a flexible cache architecture, someone else could come in and
code their hearts out. :-)
(I think some larger sites may want to expire by something like
hits/age to keep common queries around.)
And, of course everything should expire if the database modification
time changes! Perhaps we keep a special record for the previous
word_db mod. time?
>Well, the 'Near' operator implementation is symmetric right now, so 'foo
>near bar' yields the same results as 'bar near foo'. Isn't this OK?
Yes, but I think this may be one of the few that really should be symmetric.
>My original issue is to find unique cache indexes anyway. I'm thinking that
>a possible solution is to implement OperatorQuery::Signature slightly
>different to OperatorQuery::GetLogicalWords, so that symmetric operands are
>lexicographically sorted. Thus,
Fair enough, though I think GetLogicalWords is still a good first
approximation. It's years better than what we have now. :-)
[on query optimizations]
>Never mind, this is a question of detail. I'll try to advance it iff I find
>some extra time.
Yeah, I'd like to push getting this integrated into htsearch first
and then hacking on all the STATUS items. :-)
>2.- An operator (title:<expression>) is more flexible. Phrases or full
>boolean expressions can be filtered by flags. This way, you might write
>title:"foo bar baz"
>title:(foo or bar)
>3.- The modifications to the parser(s) are simpler :)
Good. Both of these are the direction i was heading. Certainly #2 is
important since that's how I'd want to specify a title filter. I'll
start with just the simple parsers and test that out. Yes, your
modifications to DocMatch were exactly the direction I was thinking
for the flags, etc. I can't say I had much time to code though. :-(
>Concerning '*', isn't this a particular case of the Regex fuzzy? Sorry if
>the question is naive, I'm not well acquainted with Regex. We're using just
Yes and no. It could be, but that's really slow. Since you know
*everything* is going to match, why not just grab DocURLs() or
DocIDs() up front? It's much faster than having to hit the word
database at all.
To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon Sep 04 2000 - 12:51:48 PDT