Re: [htdig] Searching for "All" versus "Any"]


Subject: Re: [htdig] Searching for "All" versus "Any"]
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Apr 05 2000 - 11:25:55 PDT


According to ccouple1@swarthmore.edu:
> On Wed, Apr 05, 2000 at 12:43:44PM -0500, Gilles Detillieux wrote:
> > According to ccouple1@swarthmore.edu:
> > > 'Littérature' returns 54 results, none of which is the page
> > > entitled 'Littérature francophone virtuelle' BUT almost all of which
> > > contain the target string...
> >
> > A few possibilities to look into:
> >
> > 1) the page entitled 'Littérature francophone virtuelle' contains a slightly
> > different spelling of 'Littérature' than your search string. Check the
> > HTML source for the page carefully, to make sure there isn't some difference
> > in accents or spelling.
>
> Double-checked this. Search string and version in the page is the
> same. In fact, copied and pasted directly from my browser window into
> the search page.
> >
> > 2) the SGML entity for the 'é' in the title isn't being converted correctly.
> > There were problems with numeric entities in many 3.2 snapshots and the last
> > beta.
>
> hmmmm... the only problem with this line of thought is that if é
> isn't being properly converted, it wouldn't be converted across
> the entire website, so we'd never see the search string in htdig's
> results... Also, I'm running 3.1.5, not any of the 3.2 snapshots.

What I had in mind was the possibility that a different entity was used
in this title than elsewhere in other documents. That doesn't seem to be
the case.

> > 3) that page was indexed before you had the locale configured correctly,
> > and never reindexed, so the accented letter was lost. Try touching the
> > page's source file and reindexing it, or reindexing from scratch.
>
> Actually, I didn't index this particular site until after
> reconfiguring its locale. I reindexed the site (just to be on the
> safe side) using first htdig -i -c /path/to/config and then htmerge
> -c /path/to/config. The results of an "ALL" search for 'Littérature
> francophone virtuelle' remain the same - 54 results, without the target
> page entitled 'Littérature francophone virtuelle'.

OK, how about creating a different config file that sets start_url to
only the one page that's giving you problems, and perhaps change
database_dir to avoid clobbering your current database, and then running
"htdig -ivvvvc newconfig.conf" to see what htdig is doing when in parses
the title of this page. Take a look at the resulting db.wordlist as well,
to see if "littérature" (or some mangled form of it) is getting into the
database.

By the way, when you say the page is entitled 'Littérature francophone
virtuelle', do you mean the document's <head> section contains
"<title>Littérature francophone virtuelle</title>", or do you mean it's
the main heading (i.e. <h1>) in the document? Are your title_factor
and/or heading_factor_1 non-zero?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Apr 05 2000 - 10:24:57 PDT