Re: [htdig] Following links, not indexing a doc


Subject: Re: [htdig] Following links, not indexing a doc
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Nov 07 2000 - 11:23:15 PST


According to Eric Bliss:
> Htdig has been acting well for us for some time now, but there is one glitch that has been brought to my attention.
>
> We have a number of websites which are updated on a regular basis. Because of this, old pages are being unlinked every week from
> the main body of the site. To keep these pages in the search engine database (as opposed to being lost forever), I've created a
> page for each website that just consists of the URLs of each of these pages. At the top of these pages, I place the meta tags to
> tell htdig to follow the links, but not index the page <META NAME="ROBOTS" CONTENT="NOINDEX">. I use these pages as the base
> documents for htdig to crawl from.
>
> My problem is that although htdig's website says that it follows the robot rules, my index documents still show up when a search is
> done. Is there a different tag I should be using, or do you need to specify a setting in htdig for it to obey robot rules?

There's a subtle bug in 3.1.5 and earlier versions. The content parameter
of the meta robots tag should be case-insensitive, but htdig was expecting
lower-case. You can either change the tag, or apply this patch to fix the
code:

   ftp://ftp.ccsf.org/htdig-patches/3.1.5/robotsCaseI.0

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Tue Nov 07 2000 - 11:30:06 PST