[htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate


Didier Gautheron (dgautheron@magic.fr)
Mon, 25 Jan 1999 01:07:58 +0000


* List: htdig3-dev@sob.htdig.org

Alexander Bergolth wrote:
>
> * List: htdig3-dev@sob.htdig.org
>
> At 11:06 22.01.99 , Alexander Bergolth wrote:
> >db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22
> >
> >sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted
> >
> >wc -l wu-index.1999-01-22-sorted
> > 125273 wu-index.1999-01-22-sorted
> >
> >uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq
> >
> >wc -l wu-index.1999-01-22-uniq
> > 78695 wu-index.1999-01-22-uniq
>
> Tonight I removed the old docs.index file before doing an initial dig and
> now the urls are unique:
>
> speth08:/<1>htdig/db > wc -l wu-index.1999-01-23-sorted
> 75849 wu-index.1999-01-23-sorted
> speth08:/<1>htdig/db > uniq -c wu-index.1999-01-23-sorted | wc -l
> 75849
>
> Looks like some old URLs are not deleted from this database...
>
> Btw. I noticed a significant speed decrease of the current CVS version in
> comparison to the CVS-tree from Dec 27th.
>
> The last initial dig on Jan 15th completed in 3:45 hours with a
> max_doc_size of 1MB, the current Version took 4:51 hours to complete with a
> max_doc_size of 512k.
>
> I tried both versions several times and the run-time didn't vary more than
> 10 minutes. There are currently no known or noticable network problems. (We
> even changed the ATM interface yesterday.)
>
> Does anyone have similar experiences?
The problen is in HTML::parse the skip_start stuff have to be declare
static and move out of the loop.
Didier
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST