Re: [htdig] Problems with GET URLS


Subject: Re: [htdig] Problems with GET URLS
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Apr 11 2000 - 15:21:40 PDT


On Mon, 10 Apr 2000, Paul Wolstenholme wrote:

> > they'd do the job for everyone who requested duplicate suppression, though.
> > There's been talk of using MD5 checksums for this purpose. It's on the
> > to-do list, but I don't know of anyone actively working on it.
> >
> For those interested in a couple of proposed document identification
> standards, here are a couple of URLS:
>
> Digital Object Identifier (DOI)
> http://www.doi.org/
> Publisher Item Identifier (PII)
> http://www.aip.org/epub/piius.html

There's also the HTTP/1.1 entity tag header.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.11

AFAICT, Apache actually uses full MD5 checksums for ETag: headers on
static files. However, the spec itself says:

An entity tag MUST be unique across all versions of all entities
associated with a particular resource. A given entity tag value MAY be
used for entities obtained by requests on different URIs. The use of the
same entity tag value in conjunction with entities obtained by requests on
different URIs does not imply the equivalence of those entities.

Rats. Obviously the fastest schemes for duplicate detection are those that
don't require significant add'l computation. So picking up a headerwould
be a nice way of doing it. Finding a meta tag is also nice, but it's not
guaranteed. For example, not every doc will have these.

This is a good direction, however. Anyone know of other document
specifications or good ways of identifying duplicates?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Apr 11 2000 - 13:06:42 PDT