[htdig3-dev] httools Concept


Geoff Hutchison (ghutchis@wso.williams.edu)
Wed, 12 May 1999 13:51:22 -0400


So I've thought about Gilles' idea for htdump/htload tools. I was
wrong--these are very good ideas. The former should be pretty simple--just
call the DocumentDB::CreateSearchDB() method. That method will need to be
expanded, but it's already there. The latter would need, say, a
DocumentDB::ExtractSearchDB() method.

In essence, the code would be extremely minimal and open up all sorts of
possibilities. People who wish to move databases from a bigendian to a
littlendian machine could do so. People could potentially edit their
databases by hand. In the future, people could move databases from Berkeley
DB to SQL to whatever.

Furthermore, it seems like htmerge will soon fall into the category of a
database tool. If we hit our goal of allowing digging on active databases,
htmerge would primarily serve to merge separate databases together, right?

Finally, there's the question of 'pruning' a database. With the revised
DocumentDB code Hans-Peter put in, the document side of htmerge just
removes dead documents. This is an important question for continual
digging: how do you delete documents? We can easily remove the single
record in the DocDB, but documents obviously have multiple word records.

Do we delete by looping through the word DB and removing words in dead
documents? Or do we simply block access to dead documents (so they don't
come up in searches) and have a separate phase to remove them, similar to
htmerge now?

In short, I could see a few, possibly separate C++ tools:
 * htdump - dump the current DocDB and/or WordDB
 * htload - load a DocDB and/or WordDB
 * htmerge - merge two (or more) databases together
(* htprune) - prune empty documents from the database

Does this sound like a good idea?
-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed May 12 1999 - 11:02:17 PDT