AW: [htdig] Going for the big dig


Subject: AW: [htdig] Going for the big dig
From: Reich, Stefan (Stefan.Reich@dgn-service.de)
Date: Wed Dec 20 2000 - 00:35:24 PST


Not quite sure if this helps, but maybe ;-)

If I'm right, Lotus.com is running on Notes Domino something servers.

We've experienced lots of problems with this notes servers, because of their
meshed link structure.

As Notes Links are generated on the fly by the server, you get something
like:

http://www.notes.net/notesua.nsf/a08df36b2299a8bc8525665d006dce40/7a66d789bd
990c06852569200051a9a5?OpenDocument

A Reference on this Page could be another link like this. As the numbers are
not static, you can go to the same page, with millions of different URLs.

This is especially true on tree views. If you expand a Tree section and
collaps the section again, you should be on the same document, but you'll
get another URL for this. So you can expand again and collaps again and
expand again and collapse again and ... this may be the problem, as HTDIG
never gets to the end of this mesh-space.

I cannot validate this for lotus.com but this is definetly true for some of
the notes servers we tried to index.

On Lotus.com

http://doc.notes.net/msd/1.1/msddoc.nsf/8525601a0077f5dc85255d7c00545af7?Ope
nView&Start=1&Count=30&Collapse=1.2#1.2

seems to bring you into such a space (but it looks not as bad as some I've
seen on other servers. I cannot find a real loop there, but maybe on some
other parts on the server)

So indexing Notes Servers (or at least badly designed Web Interfaces of
Notes Servers) can be a nightmare.

Cheers

  Stefan

-----Ursprüngliche Nachricht-----
Von: David Gewirtz [mailto:david@ZATZ.com]
Gesendet: Dienstag, 19. Dezember 2000 18:48
An: htdig@htdig.org
Betreff: Re: [htdig] Going for the big dig

Thanks to some of the answers to my question below. But I'm still not clear
on something. I attempted to index a remote site, in this case Lotus.com.
Now, I have no idea how many pages that is. But I let the index process run
for three days and by the end of three days, Linux was page-swapping like a
banshee and was becoming substantially unresponsive. Given that that was
only one site, and I'm thinking about indexing a lot more, I've been trying
to figure out what I need to do to make the hardware/software able to
handle it. Right now, I'm thinking the process is too big. Can htdig and/or
htmerge running on a 258MB or 384MB machine handle indexing/merging sites
like lotus.com or other large sites, or is this beyond the scope of this
tool? And, if we don't know the size of external sites, how can I go about
thinking through this issue?

-- DG

>So, I'm finishing up pre-deployment testing and I seem to have run into
>limits of the system. I'm running htdig on a 256MB PIII, Mandrake 7.2
>system. When I just index our own sites, digging is fast and the system
>seems quite responsive. But, ideally, I'd like to dig 40-60 sites per
topic
>(say, Lotus Domino sites) and then maybe 3 or more topics. But it seems
>that although this box has a large amount of RAM (it maxes at 384M) and a
>40GB disk, the digging process is just too memory intensive and eveything
>slows down to a crawl.
>
>So, here's question: can I index large sites (like, say, lotus.com)? Or
are
>we just going to run into machine limits and I'm best off using htdig for
>my own sites and leave the dream of indexing outside sites to a later
project?
>
>If I'm missing something, or their's an ideal configuration for attempting
>this approach, please enlighten me.
>
>Thanks!
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Dec 20 2000 - 00:47:44 PST