Re: htdig: Searching Problems

Geoff Hutchison (
Fri, 15 Jan 1999 16:09:40 -0500 (EST)

On Fri, 15 Jan 1999, Vishal Shah wrote:

> When I enter a phrase like "I want information on Plastics", the results
> does not return anything .

This would only happen if you have your method set to "all" or "and." Then
the search requires *every* word in the query to be present for a match to
be returned.

> Is there a way to perform natural language searching with htdig or it is
> restricted to entering only keywords.

Not really. I have a patch that considers word frequency in a query.
Ideally this would discount words like "want" in your above query and
focus more on "information" and "plastics." At the moment, it doesn't seem
to help much. Anyone interested in playing around with it should let me
know. It won't be in 3.1.0.

A full "natural language" parser is a big problem. I'm not convinced
AltaVista and similar engines do anything more than I've described--match
words in the query against their dictionary frequency and look for
documents with higher frequency than in general usage. After all, when I
enter your request to AltaVista, it tells me it searched on "I," "want,"
"information," "on," and "plastics."

> Also, if i enter "information on plastics", it gives me the documents
> matching with 'information' rather than 'plastics'

This would imply either
a) you have nothing that matches "plastics" exactly
b) the documents with "information" have higher weights than those with

The first is a good possibility. The second might be improved by the patch
I mentioned. But it might not. Do documents with "information" have that
word more often than those with "plastics?"

> Is there a configuration which I have to change in the conf file or do I
> have to include other algorithms like soundex, metaphone in addition to
> exact, which is the default ?

Soundex, metaphone and similar fuzzy algorithms will match "plastic" for
"plastics" or "information" for "informatoin" or similar. They won't give
you a natural language parser. They help most with misspellings.

-Geoff Hutchison
-Geoff Hutchison

