Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue May 02 2000 - 13:41:12 PDT


Hi Gilles,

please see my other posting for more data.

On Tue, 2 May 2000, Gilles Detillieux wrote:

>> <I, plp, wrote:>
>>
>> Since I index some directory trees as is, they have the parent directory
>> entry. Now, some of the directories are NOT under the HTML document tree.
>> In fact, all the looping problems occur outside the normal HTML tree, in
>> directory index land. I have verified that the original Suse HTML docs can
>> be indexed cleanly in limited time (I use this as a test case).
>>
>> So there is a bug in there, but where ? What makes this part of the tree
>> different from all others ? Apache has fancy indexing turned on.
>
>What do you consider to be the "normal HTML tree"? Are you referring to
>a certain subsetof your whole web site, which is all you want to index?

Yes.

>If so, you probably need to make sure your limit_urls_to attribute is
>set to limit indexing to that sub-tree. If you mean the parent directory
>entries of some pages actually lead htdig right offthe server's DocumentRoot
>directory and into directories that are not supposed to be visible from
>a web browser, that really shouldn't be happening at all, unless you have
>seriously misconfigured your server. The parent directory links may lead
>backup to the DocumentRoot, but that should be it, so if you're indexing
>the whole HTML document tree from the DocumentRoot down, these links should
>not lead anywhere htdig hasn't already visited.

Aha. But they do, because I have added the documentation sections of some
packages and programs to the HTML system, mainly by making symbolic links
to them from under DocumentRoot. Those packages are not at all under
DocumentRoot. This cannot be helped, as I do not have the disk space to
duplicate everything, and some stuff is leaf-mounted or temporarily
mounted (cdroms), but there is an easy way to prevent htdig from seeing
the parent-links in ANY directory, and that would be a feature imho.
Please see my other posting.

I understand that htdig was never meant to be used like this, but, why not
? If you think about the parent directories that allow the filesystem to
be browsed, that's ok. Users can do that anyway from a shell if they want
to (on this machine). Files that should really not be seen, are protected
by their permissions, and the server won't serve them, and the users won't
be able to see them from their shells either, for the same reason.

I really need htdig's capabilities for this, and I'd like to see other
documentations and source files etc, directly in a browser on the LAN.

>what the problem might be in your case, so a redesign of the retriever seems
>a tad premature.

Not redesign, a small feature addition ;-).

>Well, if all else fails, perhaps posting some concrete data will help.
>Taking shots in the dark can be surprisingly effective as long as you
>have good guesses, but we seem to be out of those now, so I think a more
>analytical approach is called for.
>
>A word of caution, though: your mailer seems to be dropping characters,
>as I can see spaces missing from the text you quoted from my previous
>message. You'll want to make sure any log extracts or other data you
>post to the list doesn't get similarly mangled.

This is strange. *I* do sometimes drop characters when typing 'fast' but
this is the first time I see missing characters in a quotation. I will
make some trials and see what happens. I've been using this mailer for
quite some time now (pine on linux). Please do let me know if anything
more strange shows up wrt this.

thanks,

        Peter

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 11:23:39 PDT