[htdig] subsequent digs


Tim Perdue, Geocrawler.com (tim@geocrawler.com)
Tue, 27 Apr 1999 19:08:17 -0500


I have over 1.6 millions pages on my site, and ht://dig wants to reindex
*all* of them every time it digs.

I tried setting up a page that only includes *new* links for it to dig, but
it goes ahead and digs all the old links in its database as well.

I am *not* using the -i option.

Why won't it just dig the new links and add those pages to the database?
It's totally impractical to have it reindex the entire web site everyday (in
fact, it takes 4 days for each dig).

Dig command:

/atlas18gb/htdig/bin/htdig -c /atlas18gb/htdig/conf/1.conf -s >>
/atlas18gb/htdig/1.db/dig.log

This is my 1.conf, excluding the .gif stuff:

----start----

database_dir: /atlas18gb/htdig/1.db
start_url: http://db.geocrawler.com/archives/3/1/
limit_urls_to: http://db.geocrawler.com/archives/3/1/
backlink_factor: 0
sort: score

limit_urls_to: <<--- OK I'll fix this.
exclude_urls: /cgi-bin/ .cgi
maintainer: tim@geocrawler.com
max_head_length: 10000
#server_wait_time: 1
max_doc_size: 1500000
search_algorithm: exact:1 synonyms:0.5 endings:0.1

----end----

Thanks! ht://dig is working really well, if I can just get rid of these last
few glitches!

Tim Perdue
PHPBuilder.com / GotoCity.com / Geocrawler.com

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Apr 27 1999 - 17:27:00 PDT