Re: [htdig] Lost words


Subject: Re: [htdig] Lost words
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Jan 04 2001 - 08:58:39 PST


According to Tuomas Jormola:
> At my company we're trying to migrate from clumsy self-implemented
> search engine to htdig but it's not quite painless. The scenario is this:
> We've two databases on separate servers. One for public sites and one for
> intranet sites. The public database is only 4.2M and it has 7 sites indexed
> in it. The intra database is 690M with 4 sites/vhosts. Public search is
> accurate, fast and working great but intra search is causing troubles.
>
> For example, if you index a single site that contains lots of on-line
> manuals, the database is about 380M and word "aix" returns over 18000 hits.
> But when this site is indexed with the other intra sites, "aix" returns
> only 27 hits, most of them points to the on-line manual server as expected
> though. But where have thousands of the hits gone?
>
> So if these sites with gigabytes of content are indexed separately,
> the search is accurate but when the index is only one big db, a great
> amount of correct links is missed. Any guesses whether this is due to
> 1) feature in htdig/htmerge and if so, is there a way to disable it?
> 2) bug in htdig?
> 3) bug in Berkely db?
> 4) bad configuration?

Hard to say for sure. As you're not using htmerge to build the one
big db from the separate, samller dbs, that rules out problems in the
merging code causing this problem. Is the size of the big db roughly
equal to the sum of the sizes of the separate ones? It could be an
obscure htdig or htmerge bug, or an AIX-specific problem. This sure
isn't ringing any familiar bells, if that's what you're wondering.

> We're using htdig-3.1.5 and Berkeley db that was included in htdig archive
> running on AIX 4.3. htdig was compiled using IBM VisualAge C++ Pro for
> AIX Version 5. And here's the list of configuration options that were
> changed against the default config (excluding options that are solely
> used to control the layout of htsearch):
> ----
> # to make searching of words with umlauts work
> locale: fi_FI
> # everything is valid :)
> valid_punctuation:
> # to be able to search weird chars used in example scripts etc.
> extra_word_characters: @.-_/!#$%^&'

OK, that's a pretty unusual use of the above two attributes. Are you
aware that with these settings, the following 3 words will be treated
as separate and distinct words, and a search for one of them will not
find the other two?

        aix-based aix aix.

However, I don't think that's the cause of the problem you're reporting,
if you're using the same settings for these attributes in all your
databases.

> # numbers, too, of course
> allow_numbers: true
> # exact matches only
> search_algorithm: exact:1
> ----
>
> BTW. Every test mentioned above was performed using a db built from
> the scratch with htdig/htmerge. Also it isn't due to erroneous restrict
> or exclude values. When talking about the size of the db, I mean the total
> size of all files in db directory. No support for optional algorithms were
> built using htfuzzy. The same config file was used in every test
> (well, database_dir and start_url were included from site-specific
> config file if only indexing a single site). htdig/htmerge reported
> no errors while creating each db and there's plenty of disk space.
...
> Oh I forgot to mention that one reason for this could be that frigging AIX,
> right? But I don't want to test htdig on my Linux desktop machine before
> everything else is tried at the actual server side.

We sure haven't tested htdig very thoroughly on AIX, so I would be
inclined to suspect a system-specific problem is at work here. I think
testing your configurations on a Linux system would be a very good idea.
If the problem occurs there too, then it would point more surely to a
configuration problem or a bug. If the problem doesn't occur on Linux,
then it's almost certainly an AIX-specific thing. Either way, we'd need
more data to narrow it down.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Jan 04 2001 - 09:10:43 PST