Scoring in Version 3.1.x

by S. Budd Copyright © 2000 S. Budd

Ranking pages and the use of Meta tags with ht://Dig

1. How pages are ranked.

The search program "htsearch" ranks the web pages which satisfy the search terms before they are returned in the results page. It uses a complex rule to rank the pages. This rule takes into account the following factors which can be set either on the search form or in the site configuration file.

description_factor

Plain old "descriptions" are the text of a link pointing to a document. This factor gives weight to the words of these descriptions of the document. Not surprisingly, these can be pretty accurate summaries of a document's content.

default: 150
example: description_factor:  350

heading_factor

This is a factor which will be used to multiply the weight of word between <h1> and </h1> tags, as well as headings of levels <h2> through <h6>. It is used to assign the level of importance to headings. Setting a factor to 0 will cause words in these headings to be ignored. The number may be a floating point number.

default: 5
example:  heading_factor: 20.9

keywords_factor

This is a factor which will be used to multiply the weight of words in this list of keywords of a document. The number may be a floating point number.

default: 10
example:  keywords_factor: 12

meta_description_factor

This is a factor which will be used to multiply the weight of words in any META description tags in a document. The number may be a floating point number.

default: 50
example: meta_description_factor: 20

text_factor

This is a factor which will be used to multiply the weight of words that are not in any special part of a document. Setting a factor to 0 will cause normal words to be ignored. The number may be a floating point number.

default: 1
example: text_factor: 0

title_factor

This is a factor which will be used to multiply the weight of words in the title of a document. Setting a factor to 0 will cause words in the title to be ignored. The number may be a floating point number.

default: 100
example:  title_factor: 12

backlink_factor

This is a weight of "how important" a page is, based on the number of URLs pointing to it. It's actually multiplied by the ratio of the incoming URLs (backlinks) and outgoing URLs, to balance out pages with lots of links to pages that link back to them. This factor can be changed without changing the database in any way. However, setting this value to something other than 0 incurs a slowdown on search results.

default: 1000
example:  backlink_factor: 501.1

date_factor

This factor, like backlink_factor can be changed without modifying the database. It gives higher rankings to newer documents and lower rankings to older documents. Before setting this factor, it's advised to make sure your servers are returning accurate dates (check the dates returned in the long format). Additionally, setting this to a nonzero value incurs a performance hit on searching.

default: 0
example  date_factor: 0.35

2. Using <META .... > tags.

In HTML, any number of <META> tags can be used between the <HEAD> and </HEAD> tags of a document. There are three possible attributes to this tag, two of which are recognized by ht://Dig: One is NAME which is used to name a specific property and the other is CONTENT which is used to supply the value for a named property. For example, a document could start with something like the following:

  <HTML>
  <HEAD>
  <META NAME="htdig-keywords" CONTENT="phone telephone online
electronic directory">
  <META NAME="htdig-email" CONTENT="pat.user@nowhere.net">
  <TITLE>Some document title</TITLE>
  </HEAD>
  <BODY> 

    Body of document 

  </BODY>
  </HTML>

Htdig recognizes the following values for NAME's

NAME="htdig-keywords"

The value of this property should be a blank separated list of keywords which will get a very high weight when searching. This can be used to get around some problems with common synonyms for words in the document. For example, if a document is a telephone directory, possible keywords could be "telephone phone directory book list". Now, regardless of what text is actually in the document, it can be found if these keywords are used in the search. The weight that words in the content string will have in a search can be modified using the keywords_factor attribute as outlined above

NAME="keywords"

The value of this property should be a blank separated list of keywords, just as for the htdig-keywords property. They are treated as equivalent by htdig. The reason for two different properties is that the keywords property is used by other search engines as well, while the htdig-keywords property can be used for words you want indexed only by htdig. You can get htdig to treat other property names as equivalent to htdig-keywords, or disable the htdig-keywords or keywords properties, by changing the keywords_meta_tag_names attribute in your configuration.

NAME="description"

The value allows you to specify an alternate excerpt (description) of a page. If the config-file attribute use_meta_description is used, then any documents with descriptions will use them instead of the automatically generated excerpts. The weight that words in the content string will have in search results is controlled by the meta_description_factor attribute in your configuration.

There is also the possibility of introducing arbitrary <META NAME="xxx" tags. For example:

 <META NAME="dc.creator" CONTENT="Paul Wolstenholme">
 <META NAME="dc.creator" CONTENT="Richard Smith">

To do this you have to introduce the following two configuration entries:

keywords_meta_tag_name ( needed when digging is done)

The words in this list are used to search for keywords in HTML META tags. This list can contain any number of strings that each will be seen as the name for whatever keyword convention is used. The META tags have the following format: <META NAME="somename" CONTENT="somevalue">

default: keywords  htdig-keywords
example:  keywords_meta_tag_names: keywords description dc.creator

In the above example you would use keywords_meta_tag_names: dc.creator

max_meta_description_length (needed when digging is done) While gathering descriptions from meta description tags, htdig will truncate descriptions which are longer than this length. This is required in case a webmaster tries to swamp a search result by repeating a keyword many times.
default: 512
example:  max_meta_description_length: 1000

It is possible to have the NAME="description" CONTENT=" xxx ..... " meta tag used for the description of a found page instead of the usual excerpts. This is accomplished with the following configuration parameter:

use_meta_description

If set to true, any META description tags will be used as excerpts by htsearch. Any documents that do not have META descriptions will retain their normal excerpts.

default: false
example: use_meta_description: true


Last modified: $Date: 2001/01/22 01:21:58 $