Re: [htdig] irrelevant pages in search


Subject: Re: [htdig] irrelevant pages in search
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Nov 29 1999 - 08:44:07 PST


According to David Mifsud:
> The database is a merge of about 5 DBs, and contains around 20K
> documents. The only relation about the documents is that they
> are hosted in the same region.
>
> I have rebuilt the database, (merged another DB), and checked
> the merging log, but did not find any errors.
>
> Now the search for "buskett" does only include a very small
> ammount of irrelevant pages. But when I searched for "david
> mifsud", I got 7 out of 10 irrelevant pages(1-4, 6-8)
>
> The search algorithm I'm using es exact:1
>
> BTW, by irrelevant I mean, loading the page, doing a search
> for both the words david and mifsud, and not finding any of
> the words in the source!
>
> http://alpha.CompuCreations.com/search/

It's pretty hard to diagnose this, partly because the PHP wrapper obscures
things. Here are a few possibilities I can think of:

1) the PHP wrapper calls htsearch with options that allow some of these
irrelevant pages through, or calls htsearch with a different config file
than the one you think it's using. Try running htsearch directly from
the command line, with the -c option to explicitly specify the config
file you want, and see if the results are different.

2) you could have a corrupt database which is causing false positives.
If feasible, could you rebuild from scratch?

3) Is it possible that there are links to some of these irrelevant pages
that contain your name, or "buskett" in the link description text? Those
words would be used in searches as well - their weight is assigned by
htdig using description_factor.

4) I looked at one of the false matches, and it didn't contain "david" or
"mifsud" anywhere in it, but do those names appear anywhere in some of the
false matches, perhaps in meta tags?

> * From ghutchis@wso.williams.edu Sat Nov 27 21:58:28 1999
> * To: Dave <compu@csc.um.edu.mt>
> * Subject: Re: [htdig] irrelevant pages in search
> * Cc: htdig@htdig.org
> *
> * At 10:54 AM +0100 11/18/99, Dave wrote:
> * >Try it out at:
> * > http://alpha.CompuCreations.com/search/
> * >
> * >Words I have tried include "buskett" (results 2/3/6/10 are
> * >irrelevant, i.e. 40% from the 1st page!)
> *
> * I tried it out when you first sent the message and again now--I see
> * that a few of the results are irrelevant, but I'm not so sure all of
> * those you mention are irrelevant. At the least, I can see why they're
> * being flagged.
> *
> * You don't mention how many pages you have in your database or how
> * closely related they are. Offhand, I think some of your "irrelevant"
> * pages are scoring highly because they have a high backlink weight.
> * You might try lowering the backlink_factor
> * <http://www.htdig.org/attrs.html#backlink_factor>
> *
> * This factor weights "importance" of pages, essentially as a ratio
> * between the number of links pointing to a page divided by the number
> * of links on the page. (The ratio helps to remove "link farms" which
> * often have many links to them.)

Oh, is that what backlink_factor does? I never did understand it from
the description in attrs.html. This explanation is much clearer.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Mon Nov 29 1999 - 08:57:35 PST