Re: [htdig] Suse 6.2 + htdig 3.1.5


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue May 02 2000 - 13:56:11 PDT


According to Peter L. Peres:
> wrt: looping, and indexing only the offending directory.
>
> It seems that I have made a logical mistake, but I think that there is a
> missing feature in htdig. Apache obviously generates an absolute link to
> the parent directory in any index. Since some directories which I index
> are not under the html root, the htdig actually tried to climb the whole
> directory tree up to '/' (a couple of GBs of disks) !!!

I'm having trouble envisioning how this could happen. If you're looking
at a directory index for a sub-directory of the DocumentRoot, then the
parent directory link should not point to something outside of the
DocumentRoot. If the DocumentRoot itself has no index.html, I don't know
what Apache would give for the parent directory, but even if that did
point outside of the DocumentRoot, Apache should never serve a document
that's out of bounds. If it does, this seems to suggest a serious
misconfiguration of Apache, not to mention a potentially serious security
hole.

> Is it possible to add a feature to htdig, such that it will refuse to
> climb the directory into the parent of a given URL ? In particular, if the
> page http://here/a/b/c is to be indexed, then any URL reaped from that
> page, that is a parent of the page, should be pruned from the list of URLs
> to be indexed immedialtely and totally. Like in my example, Apache would
> report /a/b as the parent of /a/b/c. htdig must NOT follow this link. This
> ought to be obvious for anything recursing through directories.

More often than not on web sites, the top-down hierarchy of href's
does not match the top-down hierarchy of directories on a server,
so your proposal could put legitimately linked documents out of reach.

If you want to limit indexing to a particular sub-tree of your web site,
you can already do this (e.g. "limit_urls_to: http://here/a/b/"). If you
want to index an entire site, you can use "limit_urls_to: http://here/"
to do that. If Apache is serving documents outside of the DocumentRoot,
what to those URLs look like?

> The way things are now, if one would index a page on geocities, f.ex., one
> would index the whole geocities, since each geocities page contains a
> pointer to the master index.

No, this is an example of the sub-tree case I mentioned above. As long
as you set limit_urls_to correctly, it will reject URLs outside of the
sub-tree you want. This is essentially the same way you avoid indexing
the whole world-wide-web when you index a site with external links.

> Note the
> additional case of the master index on a site not referring to all of its
> sub-indexes (intranet server with public + private parts !).

I'm not sure I see the relevance here. If you want to keep some documents
hidden from htdig (or any indexing spider), then you should avoid any href's
pointing to them from the publicly indexed pages on your site. You can also
make use of exclude_urls, or your robots.txt file, to disallow indexing of
any subdirectories you want left out.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 11:43:01 PDT