Re: [htdig] modification_time_is_now again and the 'unwanted 9999 dig' bug


Subject: Re: [htdig] modification_time_is_now again and the 'unwanted 9999 dig' bug
From: Giancarlo Pinerolo (ping@alter.it)
Date: Wed Dec 01 1999 - 03:09:10 PST


I wrote:

> I don't understand why it says 'it will cut down on reindexing from such
> servers when doing updates'.
>
> EG
> 1) a doc has 'last modified' unknown (which, as I recall from a previous
> post, means actually 0)
> 2) this, on the first run, gets changer to now (lets say 30/11/1999
> 00.00)
> 3) the next runthe same doc will return 0 again
> 4) then what happens? will it
>
> a) compare 0 to 30/11 and decide that it has not been changed?
>
> or
>
> b) transform 0 to now again (let's say 01/12/1999) and reindex it?
>
> >From that phrase in the doc I guess the first, isn't it?

You wrote:

> b. The only way I can see it not being reindexed is if the server
> accepts the Last-Modified header and doesn't send the document back.
> Caveat: This is actually what happens in a specific case and is the
> reason the option is in there.

Someone pointed out that pages that do not return a mod_t are mostly
dinamic ones.
So it seems logic that assigning them a mod_t = now will force a reindex
anyway, but that phrase ('cutting on reindexing') made some confusion...

> If you're indexing from a cache
> (specifically WWWWoffle), it will see that the date you sent matches
> the date it has in cache and not bother to d/l or send it on to htdig.
>

No. It's real world indexing.
All $start_url are singly selected ones (max_hop_count: 0). No digging
at all is wanted. Nevertheless I assure that a 9999 dig starts.
(anyway wwwoffle seems to preserve the original doc's mod_t)

I think the 'unwanted 9999 dig' bug is a real one, and I jus made a test
to prove it, you can try it too:

1) initial dig:

---------htdig.conf
common_dir: /home/htdig/common
database_dir: /home/htdig/db/test
start_url: http://www.yahoo.com/
limit_urls_to: $start_url
max_hop_count: 0
create_url_list: yes
modification_time_is_now: true
date_factor: 100

------ commands to execute
/usr/sbin/htdig -v -s -t -i -l -h0 -c htdig.conf>log
/usr/sbin/htmerge -vv -s -c htdig.conf>>log

-This will correctly index only the start page of yahoo

2) update dig

--- htdig-u.conf
common_dir: /home/htdig/common
database_dir: /home/htdig/db/test
start_url: http://www.yahoo.com/
limit_urls_to: $start_url
max_hop_count: 0
create_url_list: yes
modification_time_is_now: false ### only difference
date_factor: 100

----command executed
/usr/sbin/htdig -v -s -t -l -h0 -c htdig-u.conf>>log

-This unchains the 'unwanted 9999 dig' on the whole yahoo site :-(

Maybe I'missing something though...

Gian

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Wed Dec 01 1999 - 03:12:30 PST