Geoff Hutchison (email@example.com)
Tue, 27 Apr 1999 20:37:14 -0400
At 8:08 PM -0400 4/27/99, Tim Perdue, Geocrawler.com wrote:
>I have over 1.6 millions pages on my site, and ht://dig wants to reindex
>*all* of them every time it digs.
On an update dig, it does check every page. However, it *doesn't* reindex
them. There are several checks along the way to ensure it does the least
amount of work on an update dig.
You still want to know why you have to check all of your pages, right? Your
situation is a little different. You *know* which pages have changed or are
new. On most websites, there's no way of knowing which pages have changed
since the last run or where there might be new URLs. So you have to at
least check if they've changed. Since checking is pretty fast (especially
compared to indexing), it's usually not a big deal.
Our defaults are designed for normal usage. Your mileage may vary. :-)
>I tried setting up a page that only includes *new* links for it to dig, but
>it goes ahead and digs all the old links in its database as well.
If you can set up a page with the new links, you can set up a config for
that page, index it and merge them into the larger db. Something like this
(new.conf would index new URLs):
htdig -c new.conf # No need to do a -a because we don't use these for
htmerge -m new.conf -a -c 1.conf # Don't have to run htmerge on new.conf...
[move updated regular databases into place]
[remove databases from new.conf so they don't mess up future digs]
>Why won't it just dig the new links and add those pages to the database?
>It's totally impractical to have it reindex the entire web site everyday (in
>fact, it takes 4 days for each dig).
If your updates are taking the same length of time as your original
indexing, you're actually doing the equivalent of a dig with -i. But for
your situation, it's probably *much* easier to dig and merge than to bother
Williams Students Online
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Apr 27 1999 - 17:47:02 PDT