Re: htdig: Sorting results on date (3)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 17 Dec 1998 17:10:44 -0600 (CST)


According to Geoff Hutchison:
>
> At 4:04 PM -0500 12/16/98, Gilles Detillieux wrote:
> >This will be a problem for 3.1.0b3 as well, with or without my sort patch!
> >Geoff introduced some modifications to the score calculation (before the
> >sort) which require the DocTime(), DocLinks() and DocBackLinks() from the
> >DocumentRef record. This works fine if your search doesn't match a huge
> >number of documents, but if it does, lookout!
>
> Ah, that's a good point. But what we can do is load the DocRef once, get
> the data we need and then delete the reference. This way we calculate the
> score and anything else needed for sorting, then we free the memory.

OK, here's where my ignorance of C++ becomes an obstacle. (I think I've
been managing reasonably well until now, surprisingly.)

In Display::buildMatchList(), the new score calculation code assigns
docDB[url] to thisRef each time through the loop. Each [] operator on
docDB causes a new DocumentRef to be allocated, and this may have many
kilobytes of strings in it. Assuming you don't include my sort patch,
which does the setRef, what happens to all these DocumentRef's? You don't
explicitly delete thisRef when done with it, but each pass through the
loop allocates a new one. Does C++ garbage collection reclaim the old,
unused objects before htsearch's memory usage climbs through the roof?
Would an explicit delete help? Of course, in this case my sort patch
would prevent garbage collection as all the references get stored in
the matches list.

Should my patch instead get a new DocumentRef and only copy the fields
I need? That would seem to prevent excessive memory usage. Then, even
if thisRef is explicitly deleted at the end of the loop, it wouldn't
be a problem for the alternate sort methods.

> >A quick fix, I think, would be to change String::allocate_space()
> >to delete and re-allocate the Data array if the space required
> >goes down by more than some value (e.g. 256 chars), then just set
> >the String's in the DocumentRef record to 0, unless you need them,
> >in Display::buildMatchList(). That should greatly reduce htsearch's
> >memory requirements, but does nothing to speed up the fetching of all
> >that data you just end up throwing out again. Anyone have a better plan?
>
> See above. Changing allocate_space(), of course, reduces memory in htdig
> and htmerge too. This might not be a bad feature for the String class, but
> I don't know how much it would help.

It wouldn't help all on its own, but it would allow big savings if you
then explicitly set the DocHead string to "". Another thing I noticed is
the String class always allocates in powers of two. That means a 32769
character string would take 64K. That overhead adds up if you're storing
lots of DocHead's in memory, and your max_head_length is over 32K!

Also, regardless of memory usage, if you're storing large document heads
in the DBs, getting DocumentRef's for each of thousands of matches in
a search is going to really slow down the process! You'd almost need
a way to efficiently grab the DocTime and other small fields from the
docDB without fetching the big strings like the DocHead. Or should the
heads go into a separate DB?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:54 PST