Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue May 02 2000 - 13:01:28 PDT


According to Peter L. Peres:
> On Mon, 1 May 2000, Gilles Detillieux wrote:
> >It's been said time and time again on this list, but I'll repeat it.
> >htdig DOES keep track ofvisited URLs, and does NOT re-index any page
> >with a unique URL more than once per indexing run. Have a look at the
>
> OK, so I see double, and 'less' does too, since I've been using it to look
> at the log (search). Look, I'm new to htdig, so please bear with me here.

OK, but please bear with us too, and be willing to accept what people
who are less new to htdig have to say. It turns out that I was correcting
you on a point on which Geoff had already corrected you a week ago.
(See note 2 in http://www.htdig.org/mail/2000/04/0265.html) If you're
seeing duplicates, perhaps it would be informative to actually show us
an extract of your logs, and explain how the logs were obtained (e.g. which
program and which verbosity level), rather than vague interpretations of some
unspecified logs.

> While it is always possible that I made a mistake somewhere, it is not so
> likely anymore after 6 runs or so. Maybe I am exceeding the tiny machine's
> limits where I run this and some obscure libc or db bug is causing this.

Well, we haven't ruled out any obscure bug at this point, but we don't
exactly have any data with which to narrow down where a bug may exist.
If the problem lies in your configuration file, and you're using the same
configuration file for each run, it wouldn't much matter if you ran it 6
times or 60 times.

> >- Improperlinks to SSI documents, causing a buildup of extra path
> >information on the URL.
> >- A similar buildup of ignored extra path information, or extra URL
> >parameters to a CGI script.
> >- A CGI script that generates an infinite virtual tree of URLs through
> >links to itself.
> >- Many symbolic links to documents, and hypertext links to documents
> >through some of these symbolic links, causing many different virtual
> >trees of the same set of documents.
> >- Mixed case references to documents on a case-insensitive webserver,
> >causing many different virtual trees if case_sensitive is not set to
> >false.
>
> The web server is case sensitive, the SSI includes do not cause new URLs
> to be generated, all cgi scripts are concentrated in one place and that
> was pruned. I forgot about canonicalization and .. but that brings up a
> SERIOUS problem imho:
>
> Since I index some directory trees as is, they have the parent directory
> entry. Now, some of the directories are NOT under the HTML document tree.
> In fact, all the looping problems occur outside the normal HTML tree, in
> directory index land. I have verified that the original Suse HTML docs can
> be indexed cleanly in limited time (I use this as a test case).
>
> So there is a bug in there, but where ? What makes this part of the tree
> different from all others ? Apache has fancy indexing turned on.

What do you consider to be the "normal HTML tree"? Are you referring to
a certain subset of your whole web site, which is all you want to index?
If so, you probably need to make sure your limit_urls_to attribute is
set to limit indexing to that sub-tree. If you mean the parent directory
entries of some pages actually lead htdig right off the server's DocumentRoot
directory and into directories that are not supposed to be visible from
a web browser, that really shouldn't be happening at all, unless you have
seriously misconfigured your server. The parent directory links may lead
back up to the DocumentRoot, but that should be it, so if you're indexing
the whole HTML document tree from the DocumentRoot down, these links should
not lead anywhere htdig hasn't already visited.

> >By 2-tiered, do you mean 2-pass? It seems it would be wasteful to parse
> >a document once to look for hypertext links, and then go back to it later
> >to index its contents. I somehow doubt that's what AltaVista does. In
>
> Altavista does something like that, except that I forgot to mention that
> the indexer also reaps URLs from the documents it indexes and adds them to
> the list of URLs to index. There used to be a good description on
> Altavista about what it does, and how. I don't know about now.

Doesn't matter too much. What you describe is essentially what htdig does
right now. The parser reaps words and URLs from the documents, and passes
them back to the retriever, which passes the words on to the word DB, and
passes the URLs onto the URL queue after some preliminary tests. I'm not
sure what the point of this is, though. We barely have a vague idea of
what the problem might be in your case, so a redesign of the retriever seems
a tad premature.

> I getting tired of this ! I'll take a break and continue on the weekend.
> Meanwhile I'll try to work out a system to break the htdig loops using
> something more elegant than changing permissions on web served directories
> on-the-fly.

Well, if all else fails, perhaps posting some concrete data will help.
Taking shots in the dark can be surprisingly effective as long as you
have good guesses, but we seem to be out of those now, so I think a more
analytical approach is called for.

A word of caution, though: your mailer seems to be dropping characters,
as I can see spaces missing from the text you quoted from my previous
message. You'll want to make sure any log extracts or other data you
post to the list doesn't get similarly mangled.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 10:48:23 PDT