Re: ht://Dig ...


Juergen Unger (j.unger@choin.net)
Fri, 19 Jun 1998 00:24:33 +0200


Hi !

> > I am trying to customize ht://Dig a bit for my usage and do have some
> > questions: Is it possible to tell ht://Dig that it should _not_
> > index text which is in between <a href...> and </a> tags ?
> No. To do that you will need to modify HTML.cc and set the weight of words
> there to 0.

I solved it a slightly other way. Here is the diff:

------------------------------------------------------------------------------
*** HTML.cc Fri Jun 19 00:38:53 1998
--- HTML.cc.orig Thu Jun 18 09:25:46 1998
***************
*** 238,244 ****
            {
                word.lowercase();
                word.remove(valid_punctuation);
! if ((word.length() >= minimumWordLength) && (in_ref == 0))
                {
                    retriever.got_word(word,
                                       int(offset * 1000 / contents->length()),
--- 238,244 ----
            {
                word.lowercase();
                word.remove(valid_punctuation);
! if (word.length() >= minimumWordLength)
                {
                    retriever.got_word(word,
                                       int(offset * 1000 / contents->length()),
------------------------------------------------------------------------------

we found that it doesn't make sense to put the contents of links into
the index too. If someone searched for a specific word he normaly
want to find the pages where the information is and not the pages
where are the links to the pages with the information are ;-)
But maybe it would be best to change this a bit too so that one
can switch the indexing of links on or off from the config file.

another important question for me:

what would be the best way to change the code so that the excerpt
is put from the 'description' meta-tag if it exists instead of
from the text-body. I need to implement this.

thnx in advance,
  -Juergen-Unger-

-- 
CHOIN! HCT GmbH -- http://www.choin.net



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:34 PST