Re: [htdig] Exact searches ... how?

Gilles Detillieux (
Thu, 28 Oct 1999 10:05:19 -0500 (CDT)

According to Geoff Hutchison:
> >to true, but even so, it'll only index numbers of 3 or more digits
> >(minimum_word_length), and not single digits. Maybe someone who's
> >working on the phrase matching modifications can comment?
> Since I wrote the code, I guess that means me...
> I'm not certain how the 3:2 would be treated, but Gilles is right
> that individual numbers wouldn't be indexed, even with allow_numbers
> set to true.
> However, since '3:2' is the shortest example you'll have and this is
> three characters, this would probably work under phrase searching if
> you set the following:
> extra_word_characters: ":"
> allow_numbers: true

I thought of the extra_word_characters trick, but then that would treat
any colon as part of a word. E.g.: when indexing something like
"Do not be deceived: God will not be mocked", the word "deceived:" will
go into the database with the colon attached, so a search for "deceived"
will not find it. As I said, it's a sticky problem.

That's why I was wondering how phrase searching was implemented. I can
imagine two different scenarios:

1) A phrase match requires an exact (albeit case-insensitive) match of
the entire phrase, including punctuation, but perhaps collapsing all
white space to a single space character. In this case, a search for
"Romans 3:2" would match:

        Romans 3:2
        ROMANS 3:2
but not
        romans 3 : 2

2) A phrase match only checks words in the database, and makes sure the
matching words appear in the correct sequence. In the case of "Romans 3:2"
only "romans" would be looked up in the database, the 3 and the 2 having
been thrown out because they're too short to bother with, so I assume that
any document with the word romans in it would be taken to match this

Is your implementation of phrase matching closer to the first or
second scenario, or is it something else altogether? If I recall,
what was discussed on htdig3-dev was rather like the 2nd case. The 1st
approach, which had been suggested way back, would require complete
document excerpts, not just the head, and doing a string match through
the excerpt to weed out false matches. I think this is what Wolfgang
would require, but not what's implemented. It's a tall order, but I
think there may be a need for varying levels of exactness in phrase

