Re: htdig: htmerge using 4GB

Geoff Hutchison (
Wed, 27 May 1998 12:45:47 -0400 (EDT)

On Wed, 27 May 1998, Andrew Scherpbier wrote:

> Wait a second here... Unix sort is a merge sort and uses at most 2 times the
> space of the file to sort. Look at the actual file sizes /bin/sort creates
> and you'll see that it actually does a really good job.

I know UNIX sort is a merge sort and it does seem to do a really
good. That's why I'm so confused since it seems to be eating a drive
30 times larger than the databases. But perhaps my rundig script was
the culprit after all and the "no space on device" was on the databse
drive. Since I can't stop the merge to get information until it cleans
up all of the temporary files, I don't know. But the error is coming
from /usr/local/bin/sort so I think that that's the first suspect.

At this point the discussion is moot since I'm not seeing the problem

> Maybe. It is not the documents that are being sorted, however. It is the
> individual words that are sorted so that the words are grouped together for
> insertion into the actual word database.

> Hmmm... It already kinda does this but at a lower level. The results of an
> index of a document are sorted before they are appended to the word list.
> This should improve performance since a merge sort actually performs best on
> already sorted data.

This is what I was implying. I guess I didn't read enough code to see
that sort.

> (Believe it or not, I actually created an htdig4 directory on my home machine
> over the weekend... Then I got sidetracked into researching the pipelining of
> HTTP/1.1, which could greatly improve htdig's performance...)

Great! I don't know where you're doing the research, but the w3 robot
has some sample C code as part of their libwww package. Will you be
taking submissions for htdig4 from the patch library? I could work out
the new META standards for robots and spiders (The description patch
is a first start).

-Geoff Hutchison
Williams Students Online

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:18 PST