Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again

Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Gilles Detillieux (
Date: Tue May 02 2000 - 14:35:10 PDT

According to Peter L. Peres:
> On Tue, 2 May 2000, Gilles Detillieux wrote:
> >What do you consider to be the "normal HTML tree"? Are you referring to
> >a certain subsetof your whole web site, which is all you want to index?
                   ^ (dropped space)
> Yes.
> >If so, you probably need to make sure your limit_urls_to attribute is
> >set to limit indexing to that sub-tree. If you mean the parent directory
> >entries of some pages actually lead htdig right offthe server's DocumentRoot
                                                      ^ (dropped space)
> >directory and into directories that are not supposed to be visible from
> >a web browser, that really shouldn't be happening at all, unless you have
> >seriously misconfigured your server. The parent directory links may lead
> >backup to the DocumentRoot, but that should be it, so if you're indexing
       ^ (dropped space)
> >the whole HTML document tree from the DocumentRoot down, these links should
> >not lead anywhere htdig hasn't already visited.
> Aha. But they do, because I have added the documentation sections of some
> packages and programs to the HTML system, mainly by making symbolic links
> to them from under DocumentRoot. Those packages are not at all under
> DocumentRoot. This cannot be helped, as I do not have the disk space to
> duplicate everything, and some stuff is leaf-mounted or temporarily
> mounted (cdroms), but there is an easy way to prevent htdig from seeing
> the parent-links in ANY directory, and that would be a feature imho.
> Please see my other posting.

If I recall correctly, this is the first mention you make of symbolic links
in your directories. This is what I meant about us taking shots in the dark.
When extremely relevant bits of information like that go unsaid for a week,
it makes for a frustrating week of guessing at other possible problems.

The fact of the matter is that if you add a symbolic link under your
DocumentRoot to any directory at all, whether that directory is already
somewhere else under the DocumentRoot, or somewhere else altogether, that
whole sub-tree is virtually duplicated under your document root. Apache
doesn't make a distinction between a physical directory and a symbolic
link to one. That's why you have to be very careful about symbolic links
on your web site. You can make a symbolic link to / on your site and expose
the whole site to the general public.

Having said that, there are a couple ways that already exist in htdig to
prevent it from indexing a given subdirectory (whether real or a symbolic
link). Those are exclude_urls and robots.txt. E.g., on my site, I have
the following link:

lrwxrwxrwx 1 root root 31 Feb 17 16:30 /home/httpd/html/htdig/htdoc -> ../../../../usr/doc/htdig-3.1.5

which I wanted to make publicly viewable, but which I avoid indexing by using
"Disallow: /htdig/htdoc" in my robots.txt file.

> I understand that htdig was never meant to be used like this, but, why not
> ?

It's meant to be used many ways, which is why you have to be careful to tell
it how YOU want it to be used.

> If you think about the parent directories that allow the filesystem to
> be browsed, that's ok. Users can do that anyway from a shell if they want
> to (on this machine). Files that should really not be seen, are protected
> by their permissions, and the server won't serve them, and the users won't
> be able to see them from their shells either, for the same reason.

The important distinction is that there may be files that you allow anyone
to see provided they're authorized to access your system (i.e. have a login
ID), but that you don't want anyone in the world to be able to see without

> I really need htdig's capabilities for this, and I'd like to see other
> documentations and source files etc, directly in a browser on the LAN.
> >what the problem might be in your case, so a redesign of the retriever seems
> >a tad premature.
> Not redesign, a small feature addition ;-).

I hope I don't sound arrogant by saying this, but I've found that many
newbie suggestions for new features are things that can already be done
with the software once they get familiar with the existing features.
Adding more features just makes it harder for future newbies to get
familiar with the existing features, leading to more requests for
redundant features.

> >A word of caution, though: your mailer seems to be dropping characters,
> >as I can see spaces missing from the text you quoted from my previous
> >message. You'll want to make sure any log extracts or other data you
> >post to the list doesn't get similarly mangled.
> This is strange. *I* do sometimes drop characters when typing 'fast' but
> this is the first time I see missing characters in a quotation. I will
> make some trials and see what happens. I've been using this mailer for
> quite some time now (pine on linux). Please do let me know if anything
> more strange shows up wrt this.

I've indicated 3 more of these above. So far, it just seems to be in
quoted text, so perhaps attachments will be OK.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue May 02 2000 - 12:21:57 PDT