Subject: Re: [htdig] irrelevant pages in search
From: David Mifsud (firstname.lastname@example.org)
Date: Sun Nov 28 1999 - 04:12:16 PST
The database is a merge of about 5 DBs, and contains around 20K
documents. The only relation about the documents is that they
are hosted in the same region.
I have rebuilt the database, (merged another DB), and checked
the merging log, but did not find any errors.
Now the search for "buskett" does only include a very small
ammount of irrelevant pages. But when I searched for "david
mifsud", I got 7 out of 10 irrelevant pages(1-4, 6-8)
The search algorithm I'm using es exact:1
BTW, by irrelevant I mean, loading the page, doing a search
for both the words david and mifsud, and not finding any of
the words in the source!
* From email@example.com Sat Nov 27 21:58:28 1999
* To: Dave <firstname.lastname@example.org>
* Subject: Re: [htdig] irrelevant pages in search
* Cc: email@example.com
* At 10:54 AM +0100 11/18/99, Dave wrote:
* >Try it out at:
* > http://alpha.CompuCreations.com/search/
* >Words I have tried include "buskett" (results 2/3/6/10 are
* >irrelevant, i.e. 40% from the 1st page!)
* I tried it out when you first sent the message and again now--I see
* that a few of the results are irrelevant, but I'm not so sure all of
* those you mention are irrelevant. At the least, I can see why they're
* being flagged.
* You don't mention how many pages you have in your database or how
* closely related they are. Offhand, I think some of your "irrelevant"
* pages are scoring highly because they have a high backlink weight.
* You might try lowering the backlink_factor
* This factor weights "importance" of pages, essentially as a ratio
* between the number of links pointing to a page divided by the number
* of links on the page. (The ratio helps to remove "link farms" which
* often have many links to them.)
* Hope that helps,
* -Geoff Hutchison
* Williams Students Online
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b25 : Sun Nov 28 1999 - 04:24:24 PST