Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Peter L. Peres (plp@actcom.co.il)
Date: Mon May 01 2000 - 22:42:16 PDT


Hi,

On Mon, 1 May 2000, Gilles Detillieux wrote:

>It's been said time and time again on this list, but I'll repeat it.
>htdig DOES keep track ofvisited URLs, and does NOT re-index any page
>with a unique URL more than once per indexing run. Have a look at the

OK, so I see double, and 'less' does too, since I've been using it to look
at the log (search). Look, I'm new to htdig, so please bear with me here.
While it is always possible that I made a mistake somewhere, it is not so
likely anymore after 6 runs or so. Maybe I am exceeding the tiny machine's
limits where I run this and some obscure libc or db bug is causing this.

>- Improperlinks to SSI documents, causing a buildup of extra path
>information on the URL.
>- A similar buildup of ignored extra path information, or extra URL
>parameters to a CGI script.
>- A CGI script that generates an infinite virtual tree of URLs through
>links to itself.
>- Many symbolic links to documents, and hypertext links to documents
>through some of these symbolic links, causing many different virtual
>trees of the same set of documents.
>- Mixed case references to documents on a case-insensitive webserver,
>causing many different virtual trees if case_sensitive is not set to
>false.

The web server is case sensitive, the SSI includes do not cause new URLs
to be generated, all cgi scripts are concentrated in one place and that
was pruned. I forgot about canonicalization and .. but that brings up a
SERIOUS problem imho:

Since I index some directory trees as is, they have the parent directory
entry. Now, some of the directories are NOT under the HTML document tree.
In fact, all the looping problems occur outside the normal HTML tree, in
directory index land. I have verified that the original Suse HTML docs can
be indexed cleanly in limited time (I use this as a test case).

So there is a bug in there, but where ? What makes this part of the tree
different from all others ? Apache has fancy indexing turned on.

>By 2-tiered, do you mean 2-pass? It seems it would be wasteful to parse
>a document once to look for hypertext links, and then go back to it later
>to index its contents. I somehow doubt that's what AltaVista does. In

Altavista does something like that, except that I forgot to mention that
the indexer also reaps URLs from the documents it indexes and adds them to
the list of URLs to index. There used to be a good description on
Altavista about what it does, and how. I don't know about now.

I getting tired of this ! I'll take a break and continue on the weekend.
Meanwhile I'll try to work out a system to break the htdig loops using
something more elegant than changing permissions on web served directories
on-the-fly.

bye,

        Peter

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 08:17:51 PDT