[htdig] Suse 6.2 + htdig 3.1.5

Subject: [htdig] Suse 6.2 + htdig 3.1.5
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue May 02 2000 - 13:19:43 PDT


wrt: looping, and indexing only the offending directory.

It seems that I have made a logical mistake, but I think that there is a
missing feature in htdig. Apache obviously generates an absolute link to
the parent directory in any index. Since some directories which I index
are not under the html root, the htdig actually tried to climb the whole
directory tree up to '/' (a couple of GBs of disks) !!!

This is obviously not htdig's problem, it has to do with what I am
indexing and permissions.

Therefore I have a request:

Is it possible to add a feature to htdig, such that it will refuse to
climb the directory into the parent of a given URL ? In particular, if the
page http://here/a/b/c is to be indexed, then any URL reaped from that
page, that is a parent of the page, should be pruned from the list of URLs
to be indexed immedialtely and totally. Like in my example, Apache would
report /a/b as the parent of /a/b/c. htdig must NOT follow this link. This
ought to be obvious for anything recursing through directories.

How can I do this ? (I will eventually hack it into the source - later). A
pointer to the relevant source file/idea will be welcome. What is needed,
is a parser that parses the '/'s in the target URL, and then compares each
reaped URL with the successive possible parents. If there is an exact
substring match, then it is deleted and will not be indexed. How does this
sound ?

i.e. if page /a/b/c/d is indexed, then if it contains any hrefs:
/a/b/c, /a/b or /a, they are to be ignored. However, /a/b/f should not be
ignored, nor /a/b/c/e etc.

Some sites will have pages that refer to their direct parents, however, so
how does one accomodate both ? I think that the site will either be
indexed top-down, in which case parents own the children, and it works, or
only a part of the site is to be indexed (start URL points to a
subdirectory), and then children URLs do NOT index the parents (do not
climb). The latter case requires that ANY substring of the target URL be
pruned from the URL list to be analyzed. imho this rule should become a
permanent feature of htdig. See above for how it should be implemented

The way things are now, if one would index a page on geocities, f.ex., one
would index the whole geocities, since each geocities page contains a
pointer to the master index. With my mod, one would index the addressed
page, and its children, with (nearly) no fear of complications. Of course
I don't want to index geocities over a modem PPP connection ! Note the
additional case of the master index on a site not referring to all of its
sub-indexes (intranet server with public + private parts !).

thanks for the patience,


To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue May 02 2000 - 11:00:59 PDT