htdig: What am I doing wrong


Peter Burden (jphb@scit.wlv.ac.uk)
Wed, 10 Jun 1998 21:33:38 +0100


Hello,
        We've been running htdig on a medium site (some 18000 pages)
for some time and it's been quite OK (apart form the odd time the
database build broke the disc partition). Recent analysis of results
has identified one or two problems. Are these configuration issues ?
Are there patches available ?

1. Duplicate URLs

        htdig doesn't seem too good at spotting multiple different
        URLs pointing to the same page. Host name duplication
        is handled but duplications such as

        http://www.scit.wlv.ac.uk/university/scit
        http://www.scit.wlv.ac.uk/university/scit/
        http://www.scit.wlv.ac.uk/university/scit/index.html

        are not handled. They all point to the same page and
        users are quite likely to quote any one of the three.

        It gets worse when there are symbolic links between
        directories on the server but this is a much harder
        problem than that outlined above.

2. AND (all) queries and "bad" words

        In order to keep the database size under control, I've
        told htdig not to index certain common words (stop words)
        by incorporating them in the "bad words" file.

        If I then do an "AND" query such as "School of Computing"
        htdig reports no matching items since "of" was in the
        stop word list. Surely stop words should be eliminated
        from such queries before query processing.

3. OR (any) query ranking.

        It seems (I may be wrong) that the ranking of results
        for a multi-word OR query is not influenced by the
        fact that more than one of the words occurs in an item,
        again this is not what people intuitively expect.

        A query for "Wolverhampton Science Park" first listed
        pages in which the word "Wolverhampton" was significant
        apparently in an order related to the percentage of the
        total document size occupied by this word irrespective of
        whether the page contained the words "Science" or "Park".

        [Even more puzzling the top ranking page only contained
        "Wolverhampton" in a meta tag attribute value]
                

-- 
>From Peter Burden, jphb@scit.wlv.ac.uk

Home Page http://www.scit.wlv.ac.uk/~jphb/ ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:33 PST