RE: [htdig] indexing


Subject: RE: [htdig] indexing
From: David Schwartz (davids@webmaster.com)
Date: Tue Jan 11 2000 - 22:27:09 PST


> Yes, but I'm pretty confident you'd be upset with the performance.
> Remember that it's not like it can just decide a URL is relatively
> unimportant. It needs to know what URLs are already visited as well
> as those already in the queue. So if it writes out part of the URL
> list to disk, it'll have to check the disk file for every new link it
> comes across.

        Keep a 32-bit hash of each URL in memory along with the byte offset into
the link file. If necessary, the full URL can then be retrieved from the
file on a hash collision. Odds are the hash entries will take up much less
space than the URLs, not just because they are smaller but because their
size is fixed, so they can be slabbed with zero allocation overhead.

        DS

        PS: I optimize code for a living. That'll be $50. :)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Jan 11 2000 - 22:43:12 PST