Peter Burden (email@example.com)
Wed, 10 Jun 1998 21:33:38 +0100
We've been running htdig on a medium site (some 18000 pages)
for some time and it's been quite OK (apart form the odd time the
database build broke the disc partition). Recent analysis of results
has identified one or two problems. Are these configuration issues ?
Are there patches available ?
1. Duplicate URLs
htdig doesn't seem too good at spotting multiple different
URLs pointing to the same page. Host name duplication
is handled but duplications such as
are not handled. They all point to the same page and
users are quite likely to quote any one of the three.
It gets worse when there are symbolic links between
directories on the server but this is a much harder
problem than that outlined above.
2. AND (all) queries and "bad" words
In order to keep the database size under control, I've
told htdig not to index certain common words (stop words)
by incorporating them in the "bad words" file.
If I then do an "AND" query such as "School of Computing"
htdig reports no matching items since "of" was in the
stop word list. Surely stop words should be eliminated
from such queries before query processing.
3. OR (any) query ranking.
It seems (I may be wrong) that the ranking of results
for a multi-word OR query is not influenced by the
fact that more than one of the words occurs in an item,
again this is not what people intuitively expect.
A query for "Wolverhampton Science Park" first listed
pages in which the word "Wolverhampton" was significant
apparently in an order related to the percentage of the
total document size occupied by this word irrespective of
whether the page contained the words "Science" or "Park".
[Even more puzzling the top ranking page only contained
"Wolverhampton" in a meta tag attribute value]
-- >From Peter Burden, firstname.lastname@example.org
Home Page http://www.scit.wlv.ac.uk/~jphb/ ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the body of the message.
This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:33 PST