Beyond the usual bug-fixes and feature enhancements, ht://Dig could use some key speed boosts. These would help scalability since many of the speed problems become quite pronounced for large databases. This is a proposal for some performance improvments.
The indexer itself could use a big speed boost. One key improvement would be to implement HTTP/1.1 and persistanct connections. The HTTP/1.1 protocol helps transactions in general, but is a big win for an indexer since it can make a very small number of connections. The server load is generally lower so the indexer can make more transactions. A good-looking GPL implementation is available: libghttp. Additionally, HTTP/1.1 supports byte-range requests, simplifying updates--the indexer could request 0 bytes to receive just the header.
An additional speedup could come from using a multi-threaded indexer. This would enable the indexer to connect to responsive servers while waiting for a long transaction. This might require considerable changes, but a "poor man's multithreading" is available with the new database merging code. Multiple copies of htdig can index separate subsets of a space and combine the results later. This provides much of the functionality while eliminating problems with file-locking and thread libraries.
One of the biggest possible speedups for htmerge would be to elminate it entirely. Careful design of a new database structure could remove this pass completely. The program would remain, if only to merge multiple database sets together. One bonus of redesigning the databases to eliminate the merge pass would be the ability to search the databases as they are updated.
Andrew said he had a nice solution for this problem. My suggestion would be to generate the binary reverse indexes on-the-fly and update them as nesscessary. This would require more disk transactions in the indexing phase, but should be an overall win by eliminating the merging phase and allowing for parallel indexing.
This requires little speed improvement. A welcome addition would be to send one e-mail per address by keeping a dictionary of pages by addresses.
Perhaps the biggest speed improvement for htfuzzy would be an n-gram fuzzy match. This would require an additional database of n-grams (probably trigrams) and words. This would be a nice fuzzy in and of itself, but it would also speed up the substring match. If the database is present, the substring match need only check words that contain all of the n-grams in the request. This would cut down the search set considerably.
This is perhaps the area of greatest possible improvement. Currently, we naively compile a list of all possible matches and sort the list by score (or eventually by other criteria). At best, this is O(n log n), which could take a considerably amount of time for large result sets.
Common sense and user logging indicate that most people only consider the first few matches, perhaps only the first page or two. So why do we gather matches with very low scores? Beyond that, why do we sort matches we never use in results?
One improvement would only sort matches we know we need. The simplest way to implement this "partial sort" IMHO, is to place the matches in a heap. A heapsort would then pop off as many matches as necessary while spending little effort on sorting the rest of the matches. For many results, this should be a very big speedup. Other techniques, such as radix sorting, are possible for integer comparisons.
Other optimizations focus on returning matches with very high score quickly. One source that I'm examining (I don't completely understand their algorithms yet) is Computing Iceberg Queries Efficiently and focuses on picking out high weight matches efficiently. These techniques require sampling and other probabilistic techniques. As such, they're probably not suited to small queries.