Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed May 10 2000 - 09:10:19 PDT


According to Peter L. Peres:
> wrt test case etc: I'll try to describe what happened and how I got the
> idea:
>
> 1 week ago I was indexing and I caught htdig looping in a directory called
> /usr/doc/javadoc. The /usr/doc directory is soft-linked under
> DocumentRoot, that's how htdig got there. The loop was endless, I found at
> least 5 instances of that directory in the htdig log. I cut the loop by
> changing permissions on the server, as I said.
>
> Without the patch it looped, and the first URL after /usr/doc/javadoc/api
> (an Apache index) was /usr/doc/javadoc/ again, which is also an Apache
> index. I don't know what makes the loop happen.
>
> With the patch, it went into /usr/doc/javadoc/api, and did it exactly once
> over, and that's that. There are many more places like that in the
> documentation tree that I am trying to index. I'll provide a ls-1 or
> something of that subtree for you to see if you want.

OK, I've looked through your ls -lR listing and I haven't found any
links that could lead to loops yet. Of course, some of the links that
point out of the /usr/doc tree to other subtrees could lead to links that
point back into the /usr/doc tree, so I'm not getting the whole picture.

When you noticed htdig was looping, was it coming over the exact same
urls over and over again, or was it a loop in the filesystem leading to
longer and longer paths to the same directories? E.g.:

        javadoc/api
        javadoc/api/javadoc/api
        javadoc/api/javadoc/api/javadoc/api
        javadoc/api/javadoc/api/javadoc/api/javadoc/api

It seems that if it was coming over the same URLs over and over again,
it would eventually reach a point where it would run out of unvisited
documents to index, and finally stop. However, progressively longer
paths like above would go on forever. You also mentioned that it seemed
to be indexing the entire filesystem. Do you have an example URL of a
file it shouldn't have gotten to, but did, and how it got there?

> PS: I also have a lot of 'duplicate' files in my output. These are files
> that appear under various names. I have an idea: why not md5sum each file,
> and refuse to look at it again if it hits the sum and perhaps the file
> size, regardless of the name (but if from the same host) ? I know that
> this might fail for a number of files. I would like to have htdig work on
> such 'real life' file trees without looping or something. I also believe
> that this will speed up indexing. In my case I expect a reduction of up to
> 40% in database size and time by getting rid of duplicates.

Yes, duplicate files due to different paths to the same files are a
very common problem when you index trees laden with symbolic links.
For instance, the packages tree is loaded with things like:

        packages/bison/bison.changes -> ../changes/bison.changes

so all of the .changes files will be indexed twice. You can get rid of
all of those duplicates by excluding packages/changes from the indexing,
using exclude_urls or robots.txt, so that only the symlinks to individual
files in that directory will be indexed, and not the urls directly in
that directory. Similarly, you'd want to exclude all the duplicate links
to ../../../share/texmf/doc, to avoid repeated reindexing of that directory.

MD5 checksums have been suggested before as a means of weeding out
duplicates, so if you want to work on that, we'd be glad to have your
patch.

The only problem with that is it wouldn't always lead to the best or
most desirable URL for a given document being the one that is indexed.
For example, while indexing the packages directory, it would index the
files for packages beginning with a-c, and the .changes files under
those directories, up until it hit the packages/changes directory.
At that point, it will skip over all the files it's already seen, and the
rest of the .changes files will be indexed under the changes directory
rather than under the directories of their respective packages, leading
to inconsistent paths being used for them. Of course, things like this
could easily be tweaked using excludes of certain directories to coax
htdig into favouring the paths you want.

Another option would be to use the find command to build a list of
symbolic links, pluck out the ones you want htdig to actually follow,
and put the rest into an exclude list. Depending on your setup, this
can be tedious or trivial. A lot of users have found this to be a
satisfactory approach.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 10 2000 - 06:57:53 PDT