[htdig3-dev] Re: Patch to URL.cc


Subject: [htdig3-dev] Re: Patch to URL.cc
From: Warren Jones (wjones@tc.fluke.com)
Date: Sat Jan 15 2000 - 11:16:27 PST


On Fri, Jan 14, 2000 at 11:27:16AM -0600, Gilles Detillieux wrote:

> ... Warren also made a patch to URL.cc, for which he invited
> discussion. I'm not wild about it myself, but others may differ.
> In any case, it probably shouldn't go in given the feature freeze,
> but his fix to Retriever.cc looks OK to me.

I'm not at all sure that the patch to URL.cc is the best solution,
but something like it is essential for our site, and I suspect
others are in the same situation. Here are the details:

    o We must index only valid_extensions, since we have no
      control over what individual users put in their web
      directories, and some are ...uhm... indiscriminate.

    o If a user puts a binary executable in his web directory,
      our server announces that it's type "text/html".
      I don't have control over this either.

    o Using valid_extensions also allows URL's with no extension
      (after my patch to Retriever.cc). This is as it should be,
      since many URL's with no extension are subdirectories,
      which we need to index. But other URL's with no extension
      are binary executables or heaven knows what.

    o Users can't be relied on to use a trailing slash in links
      that point to a directory, e.g. <A HREF="subdirectory/">.

In short, I see no way to tell whether a URL with no extension
is 1) a subdirectory, which we want to index or 2) binary garbage,
which we want to ignore, except to do what I've done in URL.cc:
add a trailing slash to the URL and try to retrieve it.

Still, I agree with Gilles in being a little uncomfortable with
this solution. I'd be happy if someone could suggest something
that's more elegant.

-- 
Warren

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Sat Jan 15 2000 - 11:17:56 PST