[htdig] Todo Ideas. Spam control, new search options and output control.

Alberto Olindo (bindo@assibit.it)
Wed, 17 Feb 1999 18:04:35 +0100

Hi everybody,

I am quite new to htdig. I'm playing with it to build an insurance related
search engine which is growing quite well and i hope to open it at the
beginning of march.
Htdig is great, solves problems and leaves loads of time to wonder for new
features :)
I was thinking about a couple of missing (flame me please, but no to hard):

-link: url: etc..search operators
        is this "Field-base searching" discussed in the TODO list ?

-url dependant template
        I'd like to have different templates with certain urls, major
        sponsors, free services categories from our local directory
        etc.etc. right know yuo can modify only the stars image ...

-search output site grouping.
        I'm getting loads of searches with the first 30 pages all coming
        from the same site. Obviously this is dependant on my configuration
        It would be nice to have a switch that groups all urls from the same
        site showing only the first hit and perhaps a variable like
        $(SISTER_URLS_LIST) that could be expanded to ... guess ...
        a list of linked url from the same site matching the query. :-)

-strong anti spamming control
        The sites that happen to have more often this behavior are
        intensively using keywords, description and lots of tricks to get
        high rankings. I'd like to give penalties for such things as:
        keyword spamming, empty content etc.
        something like:
        max_keyword_frequency: 6
                if i get more than 6 times the same word...
        max_keyword_density: 10%
                if I get more than 1 occurrency for each 10 words....
        keyword_spam: -2
                I could start giving a -2 penalty for extra words
        max_keyword_length: 150
                if keyword tag is more than 150 characters long ...
                give a lower keyword factor.
        different_keyword_description: true
                if keyword and description are equal discard one.
        obviously discard duplicate documents (but that's there already)

-raw excerpts
        We are also using htdig to compile searchable dbs of glossary data.
        If it was possible to have raw excerpts (we obviously have full
        documents in excerpts right now) we could dump the files and have a
        more compact and functional system.
        There is no real need after a search to send the user to the HTML
        page. But this now means loosing formatting and anchors.

enough for now.

Alberto Olindo

Assibit S.r.l.
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Wed Feb 17 1999 - 10:10:03 PST