Re: [htdig] Todo Ideas. Spam control, new search options and output

Alberto Olindo (
Fri, 19 Feb 1999 12:51:31 +0100

Geoff Hutchison wrote:
> First off, thanks for your suggestions!
> > -link: url: operators
> > is this "Field-base searching" discussed in the TODO list ?
> Not quite. I think I left "AltaVista style searches" on the TODO list.

The ToDo list states:
  AltaVista style +/- boolean queries
I thought it was ONLY +/- boolean queries :)

> > -url dependant template
> > I'd like to have different templates with certain urls, major
> > sponsors, free services categories from our local directory
> > etc.etc. right know yuo can modify only the stars image ...
> You can easily set up templates for each site and pick the template in the
> search form (using either a config file or the allow_in_form attribute set
> to template_name). You can basically set anything on the templates
> themselves.
> For examples, check out or my site's search at

Sorry, I didn't explain myself. This was not about template building.
I'ts about template choosing and MIXING in the same output.

We are also building a directory. We would like to give reviews of products
or services from each site in the directory etc. etc.

When it come to the search engine I'd like to modify the appeareance of
particular urls. If, say, the third url spanned from the search is in the
directory and has a good rating for products, I would have it in the
in a line like: (as with star_patterns)
template_patterns: templates/good_prodocts.html
                   Etc. etc.

The first 2 urls would appear with standard template and the third site in
the list would be different....

I would also use this to give URLs from our site a totally different look
from other site's pages.

> > -strong anti spamming control
> > The sites that happen to have more often this behavior are
> > intensively using keywords, description and lots of tricks to get
> > high rankings. I'd like to give penalties for such things as:
> > keyword spamming, empty content etc.
> If you're having a problem with this, I'd suggest setting something like
> this:
> keyword_factor: 0.5
> meta_description_factor: 0.5
> (i.e., basically ignore those two fields)

I'km already playing a lot with factors. It's not easy to fine tune them,
it's good.
Keywords and description are very important to us. Good use of this can
be a real plus for good search results and service. Also its just a few
pages who are spamming that bad. Perhaps it 3 in 100. Problem is that 97
are getting penalized.
Having something to distinguish would be great.
What I proposed is quite verbose but gives more control, information is our
asset, discarding it without notion is an opportunity loss.

> > max_keyword_frequency: 6
> > if i get more than 6 times the same word...
> > max_keyword_density: 10%
> > if I get more than 1 occurrency for each 10 words....
> > keyword_spam: -2
> > I could start giving a -2 penalty for extra words
> > max_keyword_length: 150
> > if keyword tag is more than 150 characters long ...
> > spammed_keyword_factor:
> > give a lower keyword factor.
> > different_keyword_description: true
> > if keyword and description are equal discard one.

> I'm also looking at a variety of search ranking improvements, including
> ranking words lower if they're more common. This would decrease the
> ranking of documents with frequent "spam" words.

well ranking-lower more common words on a global basis o per document ?
In the second case its frequency analisis.
With long documents you would like to weigh frequency against document
which gives you density....

There is another good reason IMHO for this checking. Good sites with
good service will tend to be compliant with good policies. This means that
this checking should spawn better searches.
I haven't looked to htdig source yet, (besides my c++ knowledge is ~0) so
I don't know if what I am proposing is totally out of the current design.

Alberto Olindo
Assibit S.r.l.
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Mon Feb 22 1999 - 07:08:23 PST