Re: [htdig3-dev] HEAD before GET to allow exclusion by MIME content-type


Subject: Re: [htdig3-dev] HEAD before GET to allow exclusion by MIME content-type
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Feb 02 2000 - 07:14:17 PST


According to Simon Pickup:
> We use htdig to index a remote site over a WAN, but that site serves a
> number of large binary files, and we want to save bandwidth by
> preventing these from being downloaded (as they cannot be indexed
> anyway).
>
> We are currently using "bad_extensions" to skip them by URL matching,
> but it would be more reliable if we could skip them based on the MIME
> type returned in the HTTP header. Of course to do this would mean
> sending a HEAD request first, and then only a GET if the content-type is
> one we can index. Of course this has a significant latency impact, but
> we can live with that.
>
> I was considering something like "valid_content_types" and
> "bad_content_types" analagous to "valid_extensions" and
> "bad_extensions". They would default to empty, resulting in the current
> GET-only behaviour; if either is non-empty, the behaviour would be
> HEAD-then-GET.

Currently, htdig does indeed check the Content-Type header in the response
to the GET request, and only follows through on the download if it's a
type it recognises - that would be text/* or application/pdf, or any
type for which an external parser is defined. If the content-type is
something else, htdig closes the connection before fetching the document,
thus aborting the download.

In 3.2, htdig will support persistant connections and head before get as
options. I haven't taken the time to get familiar with the new code, but
it does seem to detect and reject documents that are "not parsable".

> Has anybody considered this before? Any thoughts?
>
> Presumably it would not be too difficult to implement, and I'd consider
> writing a patch myself.

Please have a look at the 3.2.0b1 pre-release that'll be out in a day or
two, and see if it does what you want. If it doesn't, we'd welcome any
bug reports or fixes.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 02 2000 - 07:16:20 PST