Re: htdig: htmerge using 4GB


Geoff Hutchison (Geoffrey.R.Hutchison@williams.edu)
Wed, 27 May 1998 10:37:08 -0400 (EDT)


On Tue, 26 May 1998, J.E.J. op den Brouw wrote:

> Geoff Hutchison wrote:
> There's nothing to change. GNU sort is fairly good. It will be a hell
> of a job to rewrite sort...
> Maybe there is another sort which creates smaller files, but I doubt
> it...

See that's just it, I'm not looking for a "better" sort, just one which
doesn't use so many files. I would imagine one using bubblesort of
something like that would work well. It would take longer to merge but
would require 2-3 times the space of the DB for merging, not the 30-40 I'm
seeing now.

Also htdig could use insertion sort when starting an update dig. It
already has the old data, it's just going to be inserting and removing
some documents. So an insertion sort would keep all this data around (so
as not to duplicate effort).

Another possible solution is to sort the new documents using GNU sort (or
whatever sort command exists) which sould be a small number. Then htmerge
merges the is set of documents into the already existing DB and removes
any documents it needs. My back-of-the-envelope says that most of my time
running htmerge is in the sorting. Why should I be sorting already-sorted
documents?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:18 PST