Fri, 24 Sep 1999 12:38:51 +0200 (MEST)
I changed the WordList, rewrote part of the WordReference class
and introduced the new WordKey class. This was the core of the modification.
The idea behind the WordKey class is the following.
The key in the word database is made of a word + integers carrying
information. In the former format wrote by Geoff, it was word + document
id. In the new format it's word + document id + flags + location.
Why would we want to do that ? Mainly because it allows us to
sort the word occurences list using multiple criterion:
word ascending, then for each occurence of the same word by
document id ascending, then for each occurence of the word in
the same document, group together the words that have the same flags,
then sort them ascending according to their location in the document.
It is then quite easy to find the occurences of the words that are
in the same document. Even easier to find the word that occurs after a
given location in a given document (think phrase search).
I did not modify the search mechanism to take advantage of this
key structure (yet).
Encapsulating all that in a class (WordKey) makes it quite
relatively transparent to the application. I designed the WordKey class
so that it can be generated based on an ascii specification of the key
structure. If we want (afterwards) to add new fields to the search key,
it will be an easy task, as far as the WordKey class is concerned.
For various reasons too long to explain here (but if someone is
interested I will) it is very important that the WordKey class is structured
In order to make that work properly and write the regression
tests, I had to debug and cleanup a large number of things in very basic
. DB2_db + Database (DB2_hash does not exist anymore)
The interface is simpler and has hooks for prefix and
. Dictionary, Configuration classes were modified to
use 'const' where appropriate.
. Configuration operator  and Find now returns a String
instead of a char*. This is much more secure. Some
pieces of code were dangerously returning the content
of a static String to prevent deallocation. Most of
the tests dealing with configuration parameters had to
test for null pointer and empty string. Worse, some did
not check for null string, source of ugly core dumps.
. String was modified and enhanced for 'const' and
conversion (cast + as_double). Because it's very confusing
and error prone to do the following:
String foo = fct();
the operator int() aborts with an error message
that says : either use as_integer or the new empty()
method that says yes if the length of the string is 0.
. Other classes were modified slightly for constness and
use of String instead of char*.
I commented all the changes in detail in the ChangeLog. The fact
is that I've modified a *lot* of things and that since the tests are not
complete yet, I'm not 100% sure I did not break anything. I'm going to
spend the next few days adding tests and running all that thru purify.
Don't hesitate to bash me if something is going wrong.
In addition, I've made a script (htdoc/cf_generate.pl) that
generates attrs.html, cf_byprog.html and cf_byname.html from the
htlib/defaults.cc file. For that I've changed the structure of the
configuration defaults to add fields containing the needed information.
This will make things a *lot* easier to document attributes. I did that
because I spent too much time adding the word_dump attribute last time
and noticed that around 10 attributes were not documented.
-- Loic Dachary
ECILA 100 av. du Gal Leclerc 93500 Pantin - France Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61 e-mail: Loic@Dachary.org URL: http://www.senga.org/
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to email@example.com containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Fri Sep 24 1999 - 03:29:31 PDT