[htdig] limiting indexing to certain cgi pages only

Subject: [htdig] limiting indexing to certain cgi pages only
From: Jerry Preeper (preeper@cts.com)
Date: Sat Nov 06 1999 - 03:21:53 PST

I have htdig up and running fine and it indexes a database of about 8000
pages. Now I'm trying to set up a second database that will include only
the output from a specified cgi program that retireves news stories from a
database. It's not included in the main dig because I exclude all cgi
output in that one.

To start with I created a script that creates a single page with links to
all the stories from the database using the cgi script to display the
stories. I set up a separate rundig script, conf file and db directory
just for this so I can merge them later. In the conf file I have the
local_urls: http://www.foo.com/=/www/foo/htdocs/
start_url: http://www.foo.com/links.html #the page with all the links
on it
limit_urls_to: http://www.foo.com/cgi-bin/

Whenever I run htdig though, I only get the following:
/rundig2 -vvv
htdig: Run complete
htdig: 1 server seen:
htdig: www.foo.com:80 2 documents
htmerge: Total word count: 326
htmerge: Total documents: 1
htmerge: Total doc db size (in K): 9

It doesn't follow any of the links on the page, which would be something like

I'm running a tail on the access log when I run htdig and I see it asking
for the robots.txt file, but nothing else which makes some sense since I'm
running htdig through the filesystem instead of http requests. My
robots.txt file has the following in it:
User-agent: *
Disallow: /adgifs/
Disallow: /adhtml/
Disallow: /gifs/
Disallow: /icons/
Disallow: /images/

For the life of me I can't figure out what I'm missing here. I don't have
any meta tags in the document that contains all the links. Also, I'd
really like to exclude the links.html page from the database, which I'm
assuming I can do by putting in a meta tag for noindex, but I'd like to
get it indexing everything first, then I can deal with that.

Any input would be greatly appreciated. Also. please reply directly as I'm
not subscribed to the list at this time.


To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.

This archive was generated by hypermail 2b25 : Sat Nov 06 1999 - 03:33:18 PST