Geoff Hutchison (ghutchis@wso.williams.edu)
Mon, 10 May 1999 16:44:39 -0400 (EDT)
On Mon, 10 May 1999, Gilles Detillieux wrote:
> people who modify the database handling, or introduce new data encodings
> or compression, are not necessarily fluent in Perl, and those that are
> fluent in Perl aren't necessarily in tune with what's changed. A C/C++
> API could go along for the ride more easily.
True, but technically we already have a C++ API. It's documented in
htcommon/DocumentDB.[h,cc] and htcommon/DocumentRef.[h,cc]. You're
partially right that those fluent in Perl aren't necessarily in tune
with what's changed. But it's also hard to hit a quickly moving target.
I think that the whole question will be solved if we solidify
the database backend. Right now we're in a bit of a flux as that
goes. We just moved from GDBM to Berkeley. We moved through a variety
of database-related bugs to breaking compatibility with 3.1.x
databases with elements of the 3.2 code. Yet I forsee a plateau as far
as database changes. People want other formats like SQL, but we
*should* be able to find a common data format that can be backward
compatible. I think we're about set on the DocDB side, but I don't
have the free time to sit down and even crunch out the htdig side of
the new WordDB. Hopefully I'll really have that week free in two weeks.
> E.g. to upgrade from 3.2 to 3.3, use the 3.2 htdump tool to extract the
> database records, then use the 3.3 htload tool to build your new DB.
> All the more reason to include these tools as part of each and every
> release.
I would like to suggest that these not even be required. Useful,
yes, especially if someone decides to switch from Berkeley to SQL. But
ignoring the change of interface code in 3.1, the format is
essentially the same as 3.0. The DocumentRef code is very flexible
about allowing additional fields--I added the DocMetaDescription field
and it was backwards-compatible. No need for htdump, everything reads
it w/o problems.
> How about using the upcoming regex search option: /.*/
> What goes into LOGICAL_WORDS in this case is a matter we'll need to decide
> on regardless, when implementing regex searching.
I think this is a bad idea. This would perform the search by adding
every word in the database (and ignoring regex_max_matches) and then
looking each one up in the document db, yielding all documents. The
performance would be slow, at best.
We can use that *pattern* as a keyword for "all matches" but we should
escape the search parser at that point. We can efficiently retrieve
all DocumentRefs from the database through other means.
> by date or title. Scoring for more complex regex will be more
> complicated, but I think that more exact matches should give a higher
> score.
Whoa. Now you're talking about approximate regex matching. Right now, that
fuzzy will match words that *exactly match* the regex. No more, no less.
Let's leave the question of weighting fuzzy algorithms for another time.
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Mon May 10 1999 - 13:54:28 PDT