[htdig3-dev] Re: ht://dig leadership


loic@ceic.com
Thu, 15 Jul 1999 18:37:16 +0200 (MEST)


Geoff Hutchison writes:
>
> I'm actually very glad to hear from you. I've heard good things about
> Catalog and obviously we have many interests in common. I took a look at

 Glad to hear that. There is so much to do and so little time :-)

> It would be nice to share some of the webbase code (and/or URI), and
> maybe parts of mifluz and Text::Query::SQL. There's quite a bit of
> demand for a (my)SQL backend to ht://Dig, as well as parsing
> AltaVista-style queries. However, I've been up to my eyeballs in getting
> the database format changes ready and the new Transport code.

 I must confess that the first two contributions I envision are

 1) Switch to automake + libtool
 2) Use and SQL backend (encapsulate specific things in a shared lib
    module and implement a DBI like interface is roughly the idea).

 I was pleased to see you already had that in the wish list.

> I have no doubt that ht://Dig can handle millions of documents. There
> are several sites in that ballpark, plus several more around 500,000+
> documents. There are obvious problems with the size of the databases
> (many OS limit files to 2GB), but this is greatly eased in the 3.2
> codebase.

 Disclaimer : I may say stupid things because I didn't look at the code
carefully. It seems to me that a few factors effectively prevents large
scale crawler to be maintained:
      . The list of starting points URLs is in the configuration file.
        Our search engine has 150 000 starting points URLs, it is hard to
        manage if in a configuration file.
      . When the crawler updates URLs it does a network access for
        all of them. Let's say I have 10 millions URLs, this is not really
        what I want. What I want is that a URL successfully fetched is
        not verified before a week (configurable). Generally speaking I
        want to specify update strategies that depend on the URL status
        (loaded, not modified, not found). I even want to specify a
        different update strategy for every site, if appropriate (daily
        for newspapers, monthly for archives etc..).
  Of course this (and many other things) depend on the fact that you have
a real database in the back-end, not just a hash table.

> Fortunately, I certainly don't see ht://Dig going the way of isearch or
> freewais--it was a bit touch-and-go last year before Andrew opened up
> the CVS tree and I took over. None of us want to see that repeated. ;-)

 Thank you for the information.

> I'm sure that's true. Right now, I'd prefer to work on it part-time,
> though I often accept contract jobs for improving ht://Dig. Personally,
> I'd prefer to focus on the maintainer aspects than the developer since I
> don't consider myself a very outstanding coder. That's just my current
> personal preference...

 You mean you would refuse a $60 000/year proposal ?-) Assuming that the
company hire you to "continue working on ht://dig" and does not assign
a project manager to you, does not assign dead lines. The only thing the
company does is giving you a salary to make sure ht://dig will not suffer
because at some point you'll find a well paid job that will eat all your
energy. I strongly believe a project like ht://dig needs at least two or
three full time, motivated, computer geeks.

-- 
		Loic Dachary

ECILA 100 av. du Gal Leclerc 93500 Pantin - France Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61 e-mail: Loic@Dachary.org URL: http://www.senga.org/

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Jul 15 1999 - 08:55:36 PDT