Re: htdig: Sorting results on date (3)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 16 Dec 1998 15:04:16 -0600 (CST)


According to htDig user:
>
> I'm still working on my own fix for this date-sorting stuff :-)
>
> I'm working with an index of about 12000 pages. I want to sort them by
> date since it concerns the pages of a newspaper :-)
>
> So, I used Gilles' patch (in combination with snapshot 111598).

That snapshot wasn't long after 3.1.0b2 was released, so my second patch,
for b2, should work for this.

> > Memory: Real: 51M/122M act/tot Virtual: 45M/256M use/tot Free: 632K
> > PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
> > 23979 user 42 0 47M 43M WAIT 0:12 17.90% htsearch
>
> Size.. 47M (!) ....... free...632K ... *wowie* This happened when I tried
> to retrieve ALL documents from the database. (12000). Htsearch isn't able
> to sort this much results on date.
> I think time_t (anyway, compareTime) is the problem.
> BTW, I pressed CTRL-C when it reached 47M... I'm sure htsearch would
> result in a core-dump otherwise too...

Wowie is right! The time_t stuff isn't the problem, though, as a time_t
is just a 4-byte integer. The problem is that to get the DocTime(),
I have to load the DocumentRef record for each document matched in the
search. Before applying my patch, htsearch only loaded the DocumentRef
records for the few documents it was displaying on the current page,
after the sort. It turns out these records are huge, mostly, I presume,
because they contain the entire document excerpt, which may be several
KB per document.

This will be a problem for 3.1.0b3 as well, with or without my sort patch!
Geoff introduced some modifications to the score calculation (before the
sort) which require the DocTime(), DocLinks() and DocBackLinks() from the
DocumentRef record. This works fine if your search doesn't match a huge
number of documents, but if it does, lookout!

I haven't looked into the docDB code, but I think we're going to need a
way of loading just the few fields we need (time, links and backlinks,
maybe title or the first n characters of the title), so we can set the
score and do the sorting of a large number of matches, and only load the
whole DocumentRef only for displayed matches.

> When I use htsearch off the prompt, it asks me a 'value for sort'. When I
> use 'date' and I search on something what should return about 40
> results, htsearch DOESN'T sort on date! (Does this have to do with the
> snapshot release I use?)

Maybe. I'm not familiar with that snapshot, so I can't tell you what my
patch did to it. If you have a copy of htsearch/Display.cc before and
after the patch, maybe you can send me a diff -u or diff -c of the two
to examine what's going on. It's a fairly simple patch, though, and
there hasn't been that much that changed to Display.cc, that I know of.

> In Display.cc::sort:
>
> char str[80];
> ResultMatch **array = new ResultMatch*[numberOfMatches];
>
> + if (numberOfMatches>1000) numberOfMatches=1000;
>
> ----
>
> + for(j=0; j < numberOfMatches; j++)
> + {
> + array[j]->setRef(docDB[array[j]->getURL()]);
> + }
>
> matches->Release();
>
> qsort((char *) array, numberOfMatches, sizeof(ResultMatch *),
> Display::compare);

OK, I follow what you're doing here. However, if you arbitrarily cut
off everything after the first 1000 matches, before sorting, you may be
cutting out some of your best matches.

> In Display.cc::compare:
>
> int
> Display::compare(const void *a1, const void *a2)
> {
> /* I use this to sort on date.. don't care about Scores or so...*/
> char buffer1[100];
> char buffer2[100];
>
> ResultMatch *m1 = *((ResultMatch **) a1);
> ResultMatch *m2 = *((ResultMatch **) a2);
>
> time_t t1 = m1->getRef()->DocTime();
> struct tm *tm1 = localtime(&t1);
> strftime(buffer1,sizeof(buffer1),"%Y%j",tm1);
> time_t t2 = m2->getRef()->DocTime();
> struct tm *tm2 = localtime(&t2);
>
> strftime(buffer2,sizeof(buffer2),"%Y%j",tm2);
>
> return (atol(buffer2)-atol(buffer1));
> }
>
> I know this is an ugly piece of code :-) Don't bother me with that!
>
> What I do here is (maybe stupid) as follows: I take century and
> number_of_day_of_the_year. (1998364 for example). Gilles' patch is
> better on this I suppose..

It's not so much the aesthetics of it that get to me as the fact that
it seems to introduce a lot of unneeded overhead. Just subracting two
time_t values is as simple a time comparison as you can get. A time_t
is just a long int representing seconds elapsed since midnight (UTC),
Jan 1, 1970. Subtracting them gives a difference in seconds between
the two modification times.

> Using the routine above, I'm able to QUICKLY sort about 1000 documents by
> date. Therefore, I have to build a limit in htsearch so that it can only
> display 1000 matches, even if it found 12000...
>
> The 1000-limit is because time_t eats to much mem (not really sure, but
> when I comment it out htsearch doesn't give me a core-dump)

It's not the time_t that eats up the mem. Your code fetches time_t
values just like mine does. The big difference is that you only get the
DocumentRef records from the first 1000 matches, rather than all 12000.

If we can work out a way of efficiently getting the bits of information we
need for those 12000 matches (i.e. quickly and without using up tons of
memory), then my sort patch and the new scoring enhancements in 3.1.0b3
should work well, even with 12000 matches.

A quick fix, I think, would be to change String::allocate_space()
to delete and re-allocate the Data array if the space required
goes down by more than some value (e.g. 256 chars), then just set
the String's in the DocumentRef record to 0, unless you need them,
in Display::buildMatchList(). That should greatly reduce htsearch's
memory requirements, but does nothing to speed up the fetching of all
that data you just end up throwing out again. Anyone have a better plan?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:53 PST