Re: [htdig] Searching for "All" versus "Any"]


Subject: Re: [htdig] Searching for "All" versus "Any"]
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Apr 05 2000 - 12:11:05 PDT


According to ccouple1@swarthmore.edu:
> On Wed, Apr 05, 2000 at 01:25:55PM -0500, Gilles Detillieux wrote:
> > "htdig -ivvvvc newconfig.conf" to see what htdig is doing when in parses
> > the title of this page. Take a look at the resulting db.wordlist as well,
> > to see if "littérature" (or some mangled form of it) is getting into the
> > database.
>
> okay, here goes....
>
> from log of htdig session:
...
> title: Littérature francophone virtuelle (ClicNet)
...
> so, it matched the title from the header. then:
...
> it seems to match the title inside of the <H2></H2> tags
>
> Littérature does appear in the wordlist database, as well (only it is non-cap'd):
>
> littérature i:0 l:6 w:105469 c:5
>
> Is any of this helpful, at all?

Yes, it shows that htdig is indexing the document correctly. The c:5
shows the word occurred 5 times in the document, and the weight is quite
large, as you'd expect for a word in the title.

> > Are your title_factor
> > and/or heading_factor_1 non-zero?
>
> I'm not sure what you mean by this last bit...

I was wondering if perhaps you had set the title_factor attribute, or one
of the heading_factor_* attributes to 0 in your htdig.conf. This would
give the word a weight of 0, so it wouldn't be found. However, the results
above show the weight is high, so this isn't the problem. Also, there are
3 other occurrences of littérature elsewhere in the document, so even if
title_factor and heading_factor_2 were 0 you'd still get a total weight
greater than 0 for the word.

Now, you need to find out why there's a discrepancy between the results
when you index just this one document vs. when you index your whole site.
This is where it may get sticky, as we seem to have ruled out most simple
possibilities. It does seem to indicate a bug somewhere in the software,
but where?

Possibilities:
1) htdig may be losing words when you index a whole site, perhaps due to
a memory leak of some sort
2) htmerge may be losing words when merging the words together
3) database corruption could be happening, causing the database itself
to lose words

You may want to look for littérature in the full db.wordlist for the whole
site, and try to find the record that corresponds to problem document.
First, you'll need to find the document ID for this document, which
you can find if you still have a log of the last run of htdig (if you
ran with -v). If the word is in there for this document, it's likely a
database problem. If it's not, it's likely a bug in htdig or htmerge,
which you'd need to examine full htdig -vvv and htmerge -vvv logs to get
to the bottom of. It could also be caused by a sort program that's not
8-bit clean, I suppose. It may also help to look at all littérature
records in db.wordlist, before and after running htmerge, to see if
they're being merged correctly.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Apr 05 2000 - 11:10:07 PDT