Re: htdig: proposal for data structure [disk usage]


Andrew Scherpbier (andrew@contigo.com)
Mon, 08 Jun 1998 23:46:35 +0000


Jason Moore wrote:
>
> On Mon, 8 Jun 1998 plucas@frost.com wrote:
>
> > Whilst I would certainly like to be able to do phrase searching the
> > most important element of any search engine to me has got to be speed.
>
> I agree.

It is still my goal to make/keep ht://Dig fast. Don't worry. :-)

>
> > In posts over the last few months I have noticed Andrew and several
> > others suggesting a move away from GDBM (hashed type?) databases to
> > btree or even SQL to improve performance.
> >
> > I see that your examples use a GDBM database and their "index will be
> > at least 3 times the size of the collected documents".

[snip]
 
> > Would a different type of database be able to achieve the same thing
> > more efficiently? With 2.5GB of data, a 7.5GB index would be a high
> > price to pay for a useful function.

Unfortunately, a SQL-based ht://Dig will have higher space requirements.
However, I know that you are digging nightly and basically need space for two
copies of the whole database so that searches can be done during the lengthy
indexing process. This would no longer be required since the database will
always be kept in a consistent state so that searches can continue while the
robot runs around gathering data.

>
> >From the mSQL faq: http://www.Hughes.com.au/library/msql1/faq.htm
>
> For each field in a table, mSQL will also store an additional flag byte.
> mSQL also stores an additional flag byte for each row of the table.
>
> This is extremely low overhead - don't know about other databases. I'm
> still trying to find some rules for GDBM's disk usage.

Did you also read that mSQL allocates all the requested space for varchar
types? That's pretty bad! :-)

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST