[htdig3-dev] Re: [htdig] Search for new pages


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 10 May 1999 13:18:13 -0500 (CDT)


According to Geoff Hutchison:
> At 11:55 AM -0400 5/10/99, Gilles Detillieux wrote:
> >I think rather that trying to duplicate the whole C/C++ database code in
> >Perl, just to make up for a few missing features in the main ht://Dig
> >code, we should put the missing features in our primary tool set, so
> >that they go along for the evolutionary ride.
>
> While certain features would be useful to add to the C++ code, I don't
> think we should just scrap the Perl code. In particular, there have been
> numerous requests to create an API that would let people access the data
> using any number of languages.

Would that API be in the form of C++ code, or Perl code? In either case,
for the API to be useful, it will have to keep up with the evolution of
the database. I think part of the problem with the Perl code is that
people who modify the database handling, or introduce new data encodings
or compression, are not necessarily fluent in Perl, and those that are
fluent in Perl aren't necessarily in tune with what's changed. A C/C++
API could go along for the ride more easily.

> It is not easy to keep multiple codebases in sync, but the core problem is
> that the database formats are changing. But this isn't just a problem for
> writing scripts, it's also a problem for upgrading. We cannot simply change
> the database code arbitrarily. This has been my desire for the 3.2
> code--that we establish a database backend that fixes past problems and
> gives us the flexibility to expand without losing backwards-compatibility.
> As it stands, we'll probably need some utilities to convert old databases.
> Some of us can afford to rebuild from scratch. Someone with 1.6 million
> documents cannot.

Having tools to dump and reload the whole database into some sort of
consistent intermediate format would probably make this task easier.
E.g. to upgrade from 3.2 to 3.3, use the 3.2 htdump tool to extract the
database records, then use the 3.3 htload tool to build your new DB.
All the more reason to include these tools as part of each and every
release.

> >Two frequent requests that seem to get punted over to the Perl code are
> >a complete dump of the database contents, and a "what's new" feature.
>
> They're "punted" because these features have been previously implemented in
> those scripts. As for a dump of the contents, you can also do that with the
> -t option to htdig.

That option is fine if you plan ahead to get a dump while digging, but a
lot of times you'd need it after the fact. I think there's a need for
a tool to get the current contents of your database.

> >With the date range selection going into htsearch (if we manage to
> >revive that semi-complete addition), it seems these two problems could
> >be solved by the same tool. All that's lacking is a "match everything"
>
> Part of my problem with a "match everything" is deciding what you do with
> the results. Do you score the documents? Using what? What do you fill in
> for various template variables: $(WORDS) $(LOGICAL_WORDS), etc. How does
> the user call this?

How about using the upcoming regex search option: /.*/
What goes into LOGICAL_WORDS in this case is a matter we'll need to decide
on regardless, when implementing regex searching.

I don't think the score will be particularly important in this case.
Give them all the same low score. The user will probably want to sort
by date or title. Scoring for more complex regex will be more complicated,
but I think that more exact matches should give a higher score.

> The backend side is pretty easy. Just get the DocDB to cough up all it's
> members and filter it. We loop through the database in htmerge in several
> places. The filtering is already in place.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon May 10 1999 - 11:28:38 PDT