Re: [htdig] 50Gb of text

Geoff Hutchison (
Fri, 18 Jun 1999 08:36:54 -0400

" " wrote:
> What will be with htdig if I'll try to index and search 50Gb
> of text? It's serious - I have to do it but can't make an

First off, you'll need a monster of a machine. I'd guess you'll need at
least 100GB for storage and temporary space. You'll probably also want
something around 1GB of RAM. These are first guesses, I'm probably on
the low end since I've never indexed anywhere near that amount of text.
(Note that these requirements are not limited to ht://Dig--the nature of
indexing that amount of text is going to require those resources.)

> assumption on how much time the search will occur, what
> algorithm to choose not to get 1,000,000 results...

The first step towards that will be to trim out very common words. IMHO,
you really don't get anything useful from a search that returns
1,000,000 documents anyway. If you agree, you can take a look at common
words in db.wordlist (for example cut -f 1 -d ' ' | sort | uniq -c |
sort -rn) and add most of these to the bad_words list.

-Geoff Hutchison
Williams Students Online
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Fri Jun 18 1999 - 04:57:53 PDT