[htdig3-dev] Indexing a largish site


loic@ceic.com
Tue, 2 Nov 1999 12:15:42 +0100 (MET)


Toivo Pedaste writes:
>
> I was able to index about a 100000 pages in less than a day on a
> machine with 512meg of memory, on a 256meg machine it had only
> done 50000 pages after two days. The indexing process does
> seem very memory intensive if you want decent performance, I'm
> not sure what can be done about it though, it seems to be
> just lack of locality of reference into the db.words.db file.

 No locality of references, indeed.

> I believe there are plans to checksum pages so as to reject
> aliases (duplicates), how is that going? It is really something
> of an administrative nightmare to deal with a large site without it.
>
> I'm also getting close to the 2Gig file size limit on my
> words.db file, is there any strucural reason that it
> couldn't be split into multiple files?

 Four solutions : activate compression in WordList.cc, db_dump + db_load
would reduce the size of the file by half, implemnet dynamic repacker in
Berkeley DB, implement autosplit files in WordList.cc based on a key
calculated from the word.
 Of all these we are working on 1 and 2.

 What is the size of your original data ?

-- 
		Loic Dachary

ECILA 100 av. du Gal Leclerc 93500 Pantin - France Tel: 33 1 56 96 10 85 e-mail: loic@dachary.org URL: http://www.senga.org/

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Nov 02 1999 - 02:09:38 PST