Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon May 01 2000 - 15:11:50 PDT


According to Peter L. Peres:
> I have run htdig with various options and have settled down to a
> compression of 3 -l and a few other things. Performance and database size
> are much better than before, but it still tries to re-index the whole site
> after a while, i.e. it loops.
> The site is served by Apache, and contains html, pdf, txt, and other
> files, incluuding directory indexes. Adding D=M? ... etc to the bad urls's
> list seemed to help, yet the htdig program loops again.
> Is there no way to keep a list of seen URLs in htdig and avoid them ?
> That would be a first step towards avoiding looping. Surely there is a way
> to do this.

It's been said time and time again on this list, but I'll repeat it.
htdig DOES keep track of visited URLs, and does NOT re-index any page
with a unique URL more than once per indexing run. Have a look at the
code if you find that hard to believe. I would suggest that you take a
close look at the verbose output of htdig to see what the cause of the
"looping" is. Most of the time, there's a subtle difference between
the URLs, causing htdig to see different URLs which point to the same
document, and therefore reindexing the document. There are many causes
of this behaviour:

- Improper links to SSI documents, causing a buildup of extra path
information on the URL.
- A similar buildup of ignored extra path information, or extra URL
parameters to a CGI script.
- A CGI script that generates an infinite virtual tree of URLs through
links to itself.
- Many symbolic links to documents, and hypertext links to documents
through some of these symbolic links, causing many different virtual
trees of the same set of documents.
- Mixed case references to documents on a case-insensitive web server,
causing many different virtual trees if case_sensitive is not set to
false.

> In theory, it should build a map of the site as a tree and
> then walk it, refusing to 'climb' any branches. Yes I did add .. to the
> list of bad urls to avoid its going up to directory index parents. This
> seems to work.

I doubt a .. in exclude_urls would do anything, as relative URLs are
expanded to fully-qualified URLs before being checked against this list.
It's the visited URLs list in htdig that prevents it from climbing back
up the tree to nodes it's already indexed.

> I make some simple assumptions here, but if only a hash table of visited
> URLs is stored in memory, to avoid having to open the db file of urls
> every time, then this would not slow things down too much ?
> To the best of my knowledge, Altavista used a 2-tiered engine, with a
> gatherer (of URLs) and an indexer that follows the URL list built by the
> gatherer.
> Is this or will it ever be a part of htdig ?

By 2-tiered, do you mean 2-pass? It seems it would be wasteful to parse
a document once to look for hypertext links, and then go back to it later
to index its contents. I somehow doubt that's what AltaVista does. In
htdig, the engine gathers up words to index and links to follow all in
one pass. The links to follow just get added to the queue of URLs to
visit, but before htdig does that with any link, it expands it to a
fully qualified path and checks it against the list of visited URLs.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon May 01 2000 - 12:58:44 PDT