Re: htdig: htmerge using 4GB

Andrew Scherpbier (
Wed, 27 May 1998 16:09:39 +0000

Geoff Hutchison wrote:
> On Tue, 26 May 1998, J.E.J. op den Brouw wrote:
> > Geoff Hutchison wrote:
> > There's nothing to change. GNU sort is fairly good. It will be a hell
> > of a job to rewrite sort...
> > Maybe there is another sort which creates smaller files, but I doubt
> > it...
> See that's just it, I'm not looking for a "better" sort, just one which
> doesn't use so many files. I would imagine one using bubblesort of
> something like that would work well. It would take longer to merge but
> would require 2-3 times the space of the DB for merging, not the 30-40 I'm
> seeing now.

Wait a second here... Unix sort is a merge sort and uses at most 2 times the
space of the file to sort. Look at the actual file sizes /bin/sort creates
and you'll see that it actually does a really good job.

> Also htdig could use insertion sort when starting an update dig. It
> already has the old data, it's just going to be inserting and removing
> some documents. So an insertion sort would keep all this data around (so
> as not to duplicate effort).

Maybe. It is not the documents that are being sorted, however. It is the
individual words that are sorted so that the words are grouped together for
insertion into the actual word database.

> Another possible solution is to sort the new documents using GNU sort (or
> whatever sort command exists) which sould be a small number. Then htmerge
> merges the is set of documents into the already existing DB and removes
> any documents it needs. My back-of-the-envelope says that most of my time
> running htmerge is in the sorting. Why should I be sorting already-sorted
> documents?

Hmmm... It already kinda does this but at a lower level. The results of an
index of a document are sorted before they are appended to the word list.
This should improve performance since a merge sort actually performs best on
already sorted data.

Anyway, the real solution is to *not* use a hash-based database like GDBM. I
know exactly what to do to completely get rid of the htmerge phase, it is just
a question of find time to do it. :-(
(Believe it or not, I actually created an htdig4 directory on my home machine
over the weekend... Then I got sidetracked into researching the pipelining of
HTTP/1.1, which could greatly improve htdig's performance...)

Andrew Scherpbier <>
Contigo Software <>
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:18 PST