[htdig] Lost words

Subject: [htdig] Lost words
From: Tuomas Jormola (tj@Elma.Net)
Date: Mon Dec 18 2000 - 08:00:48 PST


At my company we're trying to migrate from clumsy self-implemented
search engine to htdig but it's not quite painless. The scenario is this:
We've two databases on separate servers. One for public sites and one for
intranet sites. The public database is only 4.2M and it has 7 sites indexed
in it. The intra database is 690M with 4 sites/vhosts. Public search is
accurate, fast and working great but intra search is causing troubles.

For example, if you index a single site that contains lots of on-line
manuals, the database is about 380M and word "aix" returns over 18000 hits.
But when this site is indexed with the other intra sites, "aix" returns
only 27 hits, most of them points to the on-line manual server as expected
though. But where have thousands of the hits gone?

So if these sites with gigabytes of content are indexed separately,
the search is accurate but when the index is only one big db, a great
amount of correct links is missed. Any guesses whether this is due to
1) feature in htdig/htmerge and if so, is there a way to disable it?
2) bug in htdig?
3) bug in Berkely db?
4) bad configuration?

We're using htdig-3.1.5 and Berkeley db that was included in htdig archive
running on AIX 4.3. htdig was compiled using IBM VisualAge C++ Pro for
AIX Version 5. And here's the list of configuration options that were
changed against the default config (excluding options that are solely
used to control the layout of htsearch):

# to make searching of words with umlauts work
locale: fi_FI
# everything is valid :)
# to be able to search weird chars used in example scripts etc.
extra_word_characters: @.-_/!#$%^&'
# numbers, too, of course
allow_numbers: true
# exact matches only
search_algorithm: exact:1

BTW. Every test mentioned above was performed using a db built from the scratch with htdig/htmerge. Also it isn't due to erroneous restrict or exclude values. When talking about the size of the db, I mean the total size of all files in db directory. No support for optional algorithms were built using htfuzzy. The same config file was used in every test (well, database_dir and start_url were included from site-specific config file if only indexing a single site). htdig/htmerge reported no errors while creating each db and there's plenty of disk space.

Thanks and Best Regards, Tuomas Jormola <tj@elma.net> Elma Electronic Trading - http://www.elma.net

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>

This archive was generated by hypermail 2b28 : Mon Dec 18 2000 - 08:11:08 PST