Subject: [htdig] Looking for start_url strategies
From: David Gewirtz (david@ZATZ.com)
Date: Thu Nov 30 2000 - 17:16:43 PST
I've just started tinkering with htdig (and really like it). But now I'm
trying to figure out a strategy for deciding on what sites to index. We'd
like to index most of the sites in our area of interest (say fifty or sixty
sites). The problem is, some site operators are well prepared, with
appropriate robots.txt files and META tags. Others have no search engine
directions at all.
So, for example, if we index a site that has some good information and a
ton of messy e-commerce product URLs, when the user does a request, we'll
get back a very cluttered search return. I'd like to avoid this. I'd like
to only index sites that seem like they have clean results. Unfortunately,
I can't know that ahead of time.
One thought was to index each site at a time and check it out. But that'll
take forever. Another thought was to index all the sites, but if one seems
crappy, remove it from the start_url set, do an htdig -i, and clean out the
database. But that'll require us to bring down the database for a re-index
time and once the server goes live, that's not really acceptable.
So, first question: is it possible to REMOVE a site and it's associated
URLs from a database without reinitializing?
Second question: what practices are other people using to help make sure
their search results content is reasonably clean?
Thanks in advance,
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
This archive was generated by hypermail 2b28 : Thu Nov 30 2000 - 17:24:54 PST