Subject: Re: [htdig3-dev] HEAD before GET to allow exclusion by MIME content-type
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Wed Feb 02 2000 - 07:14:17 PST
According to Simon Pickup:
> We use htdig to index a remote site over a WAN, but that site serves a
> number of large binary files, and we want to save bandwidth by
> preventing these from being downloaded (as they cannot be indexed
> We are currently using "bad_extensions" to skip them by URL matching,
> but it would be more reliable if we could skip them based on the MIME
> type returned in the HTTP header. Of course to do this would mean
> sending a HEAD request first, and then only a GET if the content-type is
> one we can index. Of course this has a significant latency impact, but
> we can live with that.
> I was considering something like "valid_content_types" and
> "bad_content_types" analagous to "valid_extensions" and
> "bad_extensions". They would default to empty, resulting in the current
> GET-only behaviour; if either is non-empty, the behaviour would be
Currently, htdig does indeed check the Content-Type header in the response
to the GET request, and only follows through on the download if it's a
type it recognises - that would be text/* or application/pdf, or any
type for which an external parser is defined. If the content-type is
something else, htdig closes the connection before fetching the document,
thus aborting the download.
In 3.2, htdig will support persistant connections and head before get as
options. I haven't taken the time to get familiar with the new code, but
it does seem to detect and reject documents that are "not parsable".
> Has anybody considered this before? Any thoughts?
> Presumably it would not be too difficult to implement, and I'd consider
> writing a patch myself.
Please have a look at the 3.2.0b1 pre-release that'll be out in a day or
two, and see if it does what you want. If it doesn't, we'd welcome any
bug reports or fixes.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Wed Feb 02 2000 - 07:16:20 PST