Subject: Re: [htdig] Indexing URLs
From: Vincent Queru (email@example.com)
Date: Wed Sep 27 2000 - 00:17:43 PDT
Gilles Detillieux wrote:
> According to Vincent Queru:
> > Some time ago, I read that someone wanted to index not only the HTML
> > source but also the URLs that the robot comes across when indexing a
> > site.
> > I DO NOT want to index the URLs but unfortunately, they get indexed : is
> > there something I missed here ?
> htdig doesn't make a point of indexing the URLs itself, but if any pages
> it indexes contain URLs as the link description text in a hypertext link,
> then that links description text gets indexed. E.g.: in this link...
> <a href="http://www.htdig.org/files/">http://www.htdig.org/files/>
> the second occurrence of the URL will be treated as plain text, as
> well as a link description, and will be indexed. There's no easy,
> automatic way of avoiding this. Your best bet is to hunt down such
> files and change them. You could set description_factor to 0, and that
> will prevent the description from being indexed for the referenced page,
> but it will do this for all link descriptions, which may be overkill and
> undesired, plus htdig will still index the description as plain text for
> the page containing the reference, so you won't get rid of it entirely.
Ok, I put the description_factor to 0 and it works fine because the site I index
is very special (it consists in one page full of links that all point to the same
page, only the arguments change (it is a dynamic PHP-coded site)).
But I still have one more question : I had included a META NAME="robots"
VALUE=noindex" tag in the page containing the links but they still got indexed, is
that normal ?
Furthermore, it is not the link description that got indexed but the link itself
(ie. the URL contained in the A HREF tag).
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
This archive was generated by hypermail 2b28 : Wed Sep 27 2000 - 00:21:35 PDT