Re: [htdig] modification_time_is_now again


Subject: Re: [htdig] modification_time_is_now again
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Dec 01 1999 - 08:57:03 PST


According to Giancarlo Pinerolo:
> from teh doc:
> > This sets ht://Dig's response to a server that does not return a modification date. By default, it stores nothing. By setting
> > modification_time_is_now, it will store the current time if the server does not return a date. Though this will return
> > incorrect dates in search results, it will cut down on reindexing from such servers when doing updates.
>
> I don't understand why it says 'it will cut down on reindexing from such
> servers when doing updates'.

As was pointed out, this is pretty inaccurate. Technically, it may cut
down on reindexing from servers that don't return Last-Modified headers,
but do honour If-Modified-Since headers from the client. In this case,
the server will give a 304 status code to htdig for any document not
modified since the last time it was indexed, so htdig won't reindex it.

In practice, such servers are probably pretty rare, though I believe the
contributor of this option did indeed have such a server.

> EG
> 1) a doc has 'last modified' unknown (which, as I recall from a previous
> post, means actually 0)
> 2) this, on the first run, gets changer to now (lets say 30/11/1999
> 00.00)
> 3) the next runthe same doc will return 0 again
> 4) then what happens? will it

Technically, the server doesn't return 0, but rather, htdig zeroes out the
modtime if modification_time_is_now is false and the server didn't return
a Last-Modified header, as an indication that the actual modtime is unknown.

> a) compare 0 to 30/11 and decide that it has not been changed?
>
> or
>
> b) transform 0 to now again (let's say 01/12/1999) and reindex it?
>
> From that phrase in the doc I guess the first, isn't it?

Neither, really. If the modtime is unset (0), htdig won't send the
server an If-Modified-Since header, so the server will return the
document unconditionally. What happens then is, in my opinion, a bug -
the Retriever will compare the old and new modtimes, even though they're
both 0, and assume the document is "retrieved but not changed", so it
won't reindex even though it's already read the entire document. If the
times are equal but 0, it really ought to reindex. Not a problem if you
set modification_time_is_now to true, though.

> Then I really think I got a bug when running an update with mod_t_is_now
> false over a base db that has been digged with m_t_i_n true :

No, I don't think so. If you originally indexed with
modification_time_is_now set to true, the document times will all be set,
either to what the server reported as the Last-Modified time, or in its
absense, the time of indexing.

When you update with modification_time_is_now set to false, the modtimes
won't match, so htdig should reindex everything, unless the server
honours the If-Modified-Since header, but then the reindexed documents
will have an unset (0) modtime, so if you reindex yet again with the
modification_time_is_now attribute set to false, htdig will not reindex
the documents, as I described above.

> in this case the max_hop count is completely unrespected and a 9999 dig
> starts.
>
> If this bug is true (in which case I bet you'd immediatly halt the
> unwanted 9999 dig, and restart it with m_t_i_n true) then any doc that
> doesn't return a mod_t will never have a chance to be reindexed again.

You lost me here. What's a 9999 dig? If I recall, there were still
some unresolved problems with hop counts during an update dig, but
as far as I can tell this is totally unrelated to modtime handling.
The modtime may affect whether a document is reindexed or not, but has
no effect on interpretation of hop counts.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Wed Dec 01 1999 - 09:09:49 PST