Alexander Bergolth (leo@strike.wu-wien.ac.at)
Sat, 23 Jan 1999 12:28:00 +0100
* List: htdig3-dev@sob.htdig.org
At 11:06 22.01.99 , Alexander Bergolth wrote:
>db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22
>
>sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted
>
>wc -l wu-index.1999-01-22-sorted
> 125273 wu-index.1999-01-22-sorted
>
>uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq
>
>wc -l wu-index.1999-01-22-uniq
> 78695 wu-index.1999-01-22-uniq
Tonight I removed the old docs.index file before doing an initial dig and
now the urls are unique:
speth08:/<1>htdig/db > wc -l wu-index.1999-01-23-sorted
75849 wu-index.1999-01-23-sorted
speth08:/<1>htdig/db > uniq -c wu-index.1999-01-23-sorted | wc -l
75849
Looks like some old URLs are not deleted from this database...
Btw. I noticed a significant speed decrease of the current CVS version in
comparison to the CVS-tree from Dec 27th.
The last initial dig on Jan 15th completed in 3:45 hours with a
max_doc_size of 1MB, the current Version took 4:51 hours to complete with a
max_doc_size of 512k.
I tried both versions several times and the run-time didn't vary more than
10 minutes. There are currently no known or noticable network problems. (We
even changed the ATM interface yesterday.)
Does anyone have similar experiences?
Fri Jan 15 03:07:00 MEZ 1999: htdig started, args: -t -i
Fri Jan 15 06:52:07 MEZ 1999: htdig completed
Fri Jan 15 07:29:44 MEZ 1999: htmerge completed
htdig: accounting.wu-wien.ac.at:80 411 documents
htdig: challenger.wu-wien.ac.at:80 66 documents
htdig: empire.wu-wien.ac.at:80 1183 documents
htdig: fgr.wu-wien.ac.at:80 286 documents
htdig: force.wu-wien.ac.at:80 355 documents
htdig: indi.wu-wien.ac.at:80 266 documents
htdig: miss.wu-wien.ac.at:80 16404 documents
htdig: wigeoweb.wu-wien.ac.at:80 86 documents
htdig: www.wu-wien.ac.at:80 59152 documents
htdig: wwwai.wu-wien.ac.at:80 3501 documents
htdig: wwwi.wu-wien.ac.at:80 6009 documents
htdig: zas.wu-wien.ac.at:80 60 documents
htmerge: Total documents: 79804
htmerge: Total doc db size (in K): 747715
-rw-rw-r-- 1 htdig harvest 198603776 Jan 15 07:29 /var/htdig/db/wu.docdb
-rw-rw-r-- 1 htdig harvest 128932544 Jan 15 06:51 /var/htdig/db/wu.docs
-rw-rw-r-- 1 htdig harvest 19973120 Jan 15 07:29
/var/htdig/db/wu.docs.index
-rw-rw-r-- 1 htdig harvest 303634377 Jan 15 07:23
/var/htdig/db/wu.wordlist
-rw-rw-r-- 1 htdig harvest 267509760 Jan 15 07:23
/var/htdig/db/wu.words.db
Sat Jan 23 03:07:00 MEZ 1999: htdig started, args: -t -i
Sat Jan 23 07:58:21 MEZ 1999: htdig completed
Sat Jan 23 08:30:03 MEZ 1999: htmerge completed
htdig: accounting.wu-wien.ac.at:80 412 documents
htdig: challenger.wu-wien.ac.at:80 66 documents
htdig: empire.wu-wien.ac.at:80 1188 documents
htdig: fgr.wu-wien.ac.at:80 296 documents
htdig: force.wu-wien.ac.at:80 338 documents
htdig: indi.wu-wien.ac.at:80 268 documents
htdig: miss.wu-wien.ac.at:80 11934 documents
htdig: wigeoweb.wu-wien.ac.at:80 83 documents
htdig: www.wu-wien.ac.at:80 60239 documents
htdig: wwwai.wu-wien.ac.at:80 3398 documents
htdig: wwwi.wu-wien.ac.at:80 6136 documents
htdig: zas.wu-wien.ac.at:80 60 documents
htmerge: Total documents: 75865
htmerge: Total doc db size (in K): 572179
-rw-rw-r-- 1 htdig harvest 173246464 Jan 23 08:29 /var/htdig/db/wu.docdb
-rw-rw-r-- 1 htdig harvest 119229661 Jan 23 07:58 /var/htdig/db/wu.docs
-rw-rw-r-- 1 htdig harvest 10503168 Jan 23 08:29
/var/htdig/db/wu.docs.index
-rw-rw-r-- 1 htdig harvest 286257559 Jan 23 08:25
/var/htdig/db/wu.wordlist
-rw-rw-r-- 1 htdig harvest 256347136 Jan 23 08:25
/var/htdig/db/wu.words.db
-----------------------------------------------------------------------
Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at
WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at
Info Center
In a world without walls and fences, who needs windows and gates?
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST