Re: htdig: Possible htsearch bug


George Adams (learningapache@my-dejanews.com)
Mon, 23 Nov 1998 09:17:02 -0700


> Do you have "remove_bad_urls" set? Since this is now
> a "bad url" it won't be removed unless this option
> is set.

Yes, I have added "remove_bad_urls: true" to my htdig.conf .

Let me clarify my setup, just in case it helps explain the problem better:

TEST #1
1) foo.html contains a link to bar.html. A search for
a keyword which appears in bar.html (and nowhere else on the site) works as expected.

2) bar.html is deleted. foo.html now contains a link
to a nonexistent file. When "rundig" is run, the missing file is noticed and removed, and a warning message about the "not found" file is generated.

TEST #2
1) foo.html again contains a link to bar.html. A search for a keyword in bar.html works as expected.

2) bar.html is deleted AND the link to bar.html is removed from foo.html . When "rundig" is rerun, no warnings are generated - however, the total number of indexed documents is now 1 fewer than what it used to be.

In both test cases, after step 2), searching for the keyword that used to appear in bar.html causes the bogus search result screen to appear:

     Documents 1 - 1 of 1 matches. More *'s
     indicate a better match.

followed by a blank page.

---------------

Again, I've found that blowing away the htdig/db directory before rerunning "rundig" fixes the problem.

A grep of db/db.words.db shows the keyword from the now-deleted bar.html is still in the wordlist - could this be why htsearch still thinks one page matches the search criteria?

Is db.words.db NOT one of the files that gets erased when "rundig" runs "htdig -i" ?

-----== Sent via Deja News, The Discussion Network ==-----
http://www.dejanews.com/ Easy access to 50,000+ discussion forums
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:51 PST