Geoff Hutchison (email@example.com)
Thu, 18 Mar 1999 12:22:10 -0500 (EST)
On Thu, 18 Mar 1999, Andrew Scherpbier wrote:
> This implies (this is going *way* off topic here...) that internally you need
> a generalized word source interface. Each document type *and* structure can
> have its own word-source which will convert whatever the document is into a
> stream of words with the right flags set. All of this would be really easy to
> do if you had: a) threads b) dynamic code loading, possibly with website
> specific/defined word-source code that gets downloaded on demand. This is why
> I like Java for a search engine... It can do all of that and still be
> portable, secure, and fast :-)
This is also done in the Isearch engine, also GPL-ed. As I said when I
stepped forward as htdig3 maintainer, I wanted to try to develop the
current C++ code as much as possible. There are certainly advantages to
rewriting some or all of the code, whether in Java or otherwise. And there
are certain advantages to using Java for a search engine, including some
of the builtin classes.
I just don't have the time personally to do a Java version. :-) If there
*is* a general concensus to move towards the htdig4, I'd gladly support it
with suggestions and whatever time I can devote.
> incorrect results. Maybe a better example would be something like
> "word-source" which would be entered into the database as "word", "source",
> and "word-source". What are the locations for those words, then?
Currently none of these enter the database! What actually makes it in is
"wordsource." Of course position need not be integer, but then we get into
> Should phrase searching look for punctuation that is supplied in the query?
> For example, would the phrase search "scherpbier, andrew" be rewritten to
> "scherpbier andrew"?
With current parsing, the comma would be dropped, since it's in
valid_punctuation. Looking for punctuation would require that it's stored
somehow in the database. A more interesting question IMHO is a search for
"ht://Dig" versus "htdig" or "GNU" v. "a gnu" ;-)
To unsubscribe from the htdig3-dev mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Mar 18 1999 - 09:40:56 PST