Re: [htdig] Suse long run: done, problem solved


Subject: Re: [htdig] Suse long run: done, problem solved
From: David Robley (huntsman@www.nisu.flinders.edu.au)
Date: Wed Apr 26 2000 - 18:32:21 PDT


On 27 Apr, Peter L. Peres wrote:
>
> Hi,
>
> the machine finished ! The loop was in the java api docs. There were no
> other loops. There is no bug in htdig wrt. this problem (looping).
>
> Here are some stats from the end:
>
> 27425.60user 10781.29system 43:11:27elapsed 24%CPU (0avgtext+0avgdata
> 0maxrent)k
> 0inputs+0outputs (18429778major+3453038minor)pagefaults 2501532swaps
>
> htdig ran with a niceness 18 for the last 25% of the indexing. Load was
> 0.8 or so during this time. docdb is about 200MB. My input was about
> 220MB.
>
> The loop problem was in the tree:
>
> /usr/doc/packages/javadoc/docs/api/
>
> which has more than 500 entries.
>
> System: i486/100MHz/24MB RAM 4.3+2.8 GB EIDE disks (not UDMA), headless
> (ethernet only) Suse 6.2 Linux (w. modified html documentation system - by
> me). As you can see the machine was swapping like crazy. I think I'd need
> a machine with 256MB RAM to avoid serious swapping. Not likely anytime
> soon.
>
> thank you all for the ideas,
>
> Peter

I think I've come across this sort of problem when trying to index a
series of documents that have a lot of internal references (A
HREF="#target"> and htdig tries to follow each of these links, ending up
going in ever decreasing circles until....

My solution was to add something like html# to the exclude_urls list.

Cheers

-- 
David Robley                        | WEBMASTER & Mail List Admin
RESEARCH CENTRE FOR INJURY STUDIES  | http://www.nisu.flinders.edu.au/
AusEinet                            | http://auseinet.flinders.edu.au/
            Flinders University, ADELAIDE, SOUTH AUSTRALIA

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Apr 26 2000 - 16:19:19 PDT