Re: [htdig] Multiple merging


Subject: Re: [htdig] Multiple merging
From: Andrea Carpani (ancarpan@vitaminic.net)
Date: Fri Nov 05 1999 - 08:19:13 PST


On 03-Nov-99 Gilles Detillieux wrote:

>> It's hard to know what's "normal" or which option would be faster.
>> Remember we're all digging very different servers, pages, etc. For
>> example, you don't mention how many URLs you have or the size of your
>> database.
>>
>> I'm guessing the merging is taking a while because either (or both):
>> a) 1200 sites => many, many URLs => large databases
>> b) the machine you're using doesn't have much RAM and is swapping to merge
>>
>> These are obviously intertwined. The amount of RAM you need is
>> related to the size of your databases...
>
> I'm wondering how Andrea is merging these 1200 separate databases.
> I don't know, but I'd guess that merging them hierarchically would be
> faster than merging them linearly. E.g., for 8 databases (1-8), you
> could merge 2-8 in turn into database 1, but it seems it would be more
> efficient to merge 2 into 1, 4 into 3, 6 into 5, 8 into 7, 3 into 1,
> 7 into 5, and finally 5 into 1. I'm guessing though. I don't know that
> anyone ever benchmarked it.

I have to merge the sites separately because I need to be able to know the
"originator" url when I have a hit: the idea is that htdig uses the <url_list>
attribute so that I get a list of "derived" urls from eache "originator" url. I
then merge all these small db's together. Is there an easiest way to trace back
the originating url from the hit ?

----------------------------------
Andrea Carpani
E-Mail: <ancarpan@vitaminic.net>

Vitaminic -- The Music Evolution --
http://www.vitaminic.it
----------------------------------

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2b25 : Fri Nov 05 1999 - 08:34:18 PST