[htdig3-dev] Re: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate


Alexander Bergolth (leo@strike.wu-wien.ac.at)
Fri, 22 Jan 1999 11:06:27 +0100 (MEZ) documents


* List: htdig3-dev@sob.htdig.org

On Wed, 20 Jan 1999, Geoff Hutchison wrote:

> At 6:11 PM -0400 1/20/99, Gilles Detillieux wrote:
>
> >A few trace prints in htmerge/docs.cc revealed the source of the 9 extra
> >documents. These were 9 documents that were disallowed by robots.txt,
> >which were deleted from the DB, because they had no DocHead, but because
> >of a missing "else", they were still indexed and counted. Here's the fix:
>
> I don't know if I believe it. That seemed to do it... After patching,
> recompiling and re-running htmerge, I get:
>
> htmerge: Total documents: 58193
> htmerge: Total doc db size (in K): 330586
>
> No complaints here. Leo, are you still seeing duplicate URLs?

Yes. :(

OK, maybe I did somethin wrong, I'll explain the test procedure:

I'm using db_dump from Berkeley DB to print the contents of the docs.index
file and extract the from this file using the following perl script:
---------- snipp! ----------
#!/usr/local/bin/perl

while ($_ ne "HEADER=END\n") {
  $_= <>;
}

while (<>) {
  $_= <>;
  print;
}
---------- snipp! ----------

db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22

sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted

wc -l wu-index.1999-01-22-sorted
  125273 wu-index.1999-01-22-sorted

uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq

wc -l wu-index.1999-01-22-uniq
   78695 wu-index.1999-01-22-uniq

:(

- Leo -

P.S.: Multiple entries are distributed as follows:

#docs appearances
60050 1
12042 2
2109 3
 587 4
 136 5
 146 6
 724 7
 906 8
1431 9
 502 10
  52 11
   9 12
   1 13

-----------------------------------------------------------------------
Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at
WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at
Info Center
In a world without walls and fences, who needs windows and gates?

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST