Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue May 09 2000 - 17:26:46 PDT


>Gerard

wrt. the cast: yes I got a warning. It was the only warning I got in that
file. There were other warnings at the end of the build process, about a
structure that was defined twice or such. I haven't had the time to look
at this.

wrt: the string length check: you are perfectly right, I will fix it, it
should be <=.

wrt: the comparison algorythm: you are right, it may be better to compare
from the top of the url down. I was caught by the way the problem
appeared. See below:

wrt test case etc: I'll try to describe what happened and how I got the
idea:

1 week ago I was indexing and I caught htdig looping in a directory called
/usr/doc/javadoc. The /usr/doc directory is soft-linked under
DocumentRoot, that's how htdig got there. The loop was endless, I found at
least 5 instances of that directory in the htdig log. I cut the loop by
changing permissions on the server, as I said.

Without the patch it looped, and the first URL after /usr/doc/javadoc/api
(an Apache index) was /usr/doc/javadoc/ again, which is also an Apache
index. I don't know what makes the loop happen.

With the patch, it went into /usr/doc/javadoc/api, and did it exactly once
over, and that's that. There are many more places like that in the
documentation tree that I am trying to index. I'll provide a ls-1 or
something of that subtree for you to see if you want.

So my idea came from direct observation of the facts, not from deeper
understanding 8-)

Anyway, I think that if htdig was to index a certain part of a site, say,
http://a/b/c then it should not climb links higher than its starting point
even if the urls to be indexed would not be specified explicitly. In
particular, if /a/b/c is a page that contains a href link to /a/d/e and
this is an Apache index, then /a/d and its children should not be indexed,
with the exception of /a/d/e. Note that /a/d/e is browsable by html users
! If this is not wanted then the Apache can be configured so. This is
nothing to do with htdig imho.

This example kind of illustrates my case. BTW, with the documentation
packages of software subsystems being attached to the packages themselves,
this kind of linking happens all the time. It is almost impossible to copy
everything under DocumentRoot. On a system like mine this makes about 70%
of the files !

thanks for the input and for the corrections,

        Peter

PS: I also have a lot of 'duplicate' files in my output. These are files
that appear under various names. I have an idea: why not md5sum each file,
and refuse to look at it again if it hits the sum and perhaps the file
size, regardless of the name (but if from the same host) ? I know that
this might fail for a number of files. I would like to have htdig work on
such 'real life' file trees without looping or something. I also believe
that this will speed up indexing. In my case I expect a reduction of up to
40% in database size and time by getting rid of duplicates.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 09 2000 - 15:09:58 PDT