Alexander Bergolth (leo@strike.wu-wien.ac.at)
Fri, 22 Jan 1999 11:06:27 +0100 (MEZ) documents
* List: htdig3-dev@sob.htdig.org
On Wed, 20 Jan 1999, Geoff Hutchison wrote:
> At 6:11 PM -0400 1/20/99, Gilles Detillieux wrote:
>
> >A few trace prints in htmerge/docs.cc revealed the source of the 9 extra
> >documents. These were 9 documents that were disallowed by robots.txt,
> >which were deleted from the DB, because they had no DocHead, but because
> >of a missing "else", they were still indexed and counted. Here's the fix:
>
> I don't know if I believe it. That seemed to do it... After patching,
> recompiling and re-running htmerge, I get:
>
> htmerge: Total documents: 58193
> htmerge: Total doc db size (in K): 330586
>
> No complaints here. Leo, are you still seeing duplicate URLs?
Yes. :(
OK, maybe I did somethin wrong, I'll explain the test procedure:
I'm using db_dump from Berkeley DB to print the contents of the docs.index
file and extract the from this file using the following perl script:
---------- snipp! ----------
#!/usr/local/bin/perl
while ($_ ne "HEADER=END\n") {
$_= <>;
}
while (<>) {
$_= <>;
print;
}
---------- snipp! ----------
db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22
sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted
wc -l wu-index.1999-01-22-sorted
125273 wu-index.1999-01-22-sorted
uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq
wc -l wu-index.1999-01-22-uniq
78695 wu-index.1999-01-22-uniq
:(
- Leo -
P.S.: Multiple entries are distributed as follows:
#docs appearances
60050 1
12042 2
2109 3
587 4
136 5
146 6
724 7
906 8
1431 9
502 10
52 11
9 12
1 13
-----------------------------------------------------------------------
Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at
WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at
Info Center
In a world without walls and fences, who needs windows and gates?
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST