Subject: Re: [htdig3-dev] Search engine for newsgroup ....
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Tue Jul 11 2000 - 10:36:36 PDT
According to Vikram Lele:
> I need further help on a small issue ...
> Does it also index contents inside <A HREF .....> tag ?
Not right within the tag itself. It only grabs the href parameter
and queues up the URL for indexing. However, it does index the link
description text between the <a ...> and </a> tags, and it does two
things with this text. First of all, it treats it as regular text
within the document being indexed, and indexes it with the weight
chosen by text_factor. It also adds it to the link description text
being stored for the referenced document (the one named in the href),
and indexes it with the weight chosen by description_factor.
> I am using HyperNews for our newsgroup. Each page has a few buttons
> for member list, subscription etc. Some how some of these things (like
> the word search, members, subscription) seem to be getting indexed.
> Am attaching a portion of HTML page that was located by the htsearch
> when I looked for the word "search". The only portion of the page
> which had the word "search" is pasted here ...
> <A HREF="http://www.imgnews.com/search.html"
> onMouseOver="return window.help('Search for Messages');"
> onMouseOut="return window.help('');">
> <IMG SRC="http://www.imgnews.com/Icons/search.gif" BORDER=0
> WIDTH=60 HEIGHT=17
What it's grabbing from here is not the word "Search" from the onMouseOver
parameter, but rather from the ALT parameter in the IMG tag. This is
indexed as regular text, with the weight of text_factor, just as if you
had specified <A HREF...>Search</A>. In a text-only browser like lynx,
ALT text is completely equivalent to body text, so it makes sense for
htdig to treat it this way as well.
> Is there a way we can avoid this getting indexed ? I tried using
> bad_words to filter this out .. but I guess that is not the correct
> place ..
No, you can't avoid indexing ALT text in IMG tags, nor indexing text
between <A> and </A> tags. Not without changing the htdig/HTML.cc code,
anyway. bad_words is for targetting individual words, regardless of
context, so that's not what you'd want.
If you can control what tags HyperNews puts out, perhaps you can configure
it to add noindex tags around the text you want to avoid.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Jul 11 2000 - 07:52:26 PDT