Re: htdig: Searching for adiacent words

Edmond Abrahamian (
Mon, 30 Nov 1998 19:53:34 +0200 (EET)

On Tue, 24 Nov 1998, Geoff Hutchison wrote:

> The problem is this--right now we essentially don't store the location of
> words. But if we want to implement phrase or proximity searching, we need
> to store the location of *every* word in *every* document. Ouch. The word
> database would be huge.

I'd like to propose a "band-aid" fix in the interim. What if we begin by
doing an ordinary AND search, and then for each hit, we search the stored
part of the document (you must have max_head_length set to a reasonably
large value) for a regex matching the exact phrase (but case insensitive).
A couple of problems at least here:

   * performance may (will?) be an issue, because we may have to potentially
     grep through a large number of documents. But hey, we grep through them
     now anyway! Because htsearch highlights those parts of the document that

   * The part of the document that was missed by max_head_length could
     be the part that _would have_ generated a match. This is a potential
     problem for missed hits.

   * we must come to an agreement on what can constitute an "exact phrase".
     if we allow "and, or, with..." to be included in an exact phrase, we're
     going to end up grepping through every document in the database!

All in all, for a "band-aid", I would be happy to use it this way, given that
it may be almost trivial to implement it (I think).

Any thoughts on this, anyone?


To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:55 PST