htdig: Hyphenation Problems and Maximum Word Length?


plucas@frost.com
Fri, 11 Dec 1998 14:27:57 -0800


Has anyone else had any problems when searching for hyphenated names or
words? I was searching for Hewlett-Packard and, even though there were
thousands of matches with the name in the first few bytes of the saved
excerpt none of the excerpts were displayed. I have a hyphen in the
valid_punctuation string in the config file so maybe the routine which
highligths the search words in the excerpt is trying to compare
HewlettPackard with Hewlett-Packard and not getting any matches?

I tried removing the hyphen from valid_punctuation but then it started
searching for Hewlett and Packard which was not quite what I wanted. I
tried a few other terms and the results were similar for hi-fi, x-ray etc.

Also does anyone know if there is a maximum word length that htdig will
store? If there is (and on my system it seems to be set to 12) how would I
change it? I am using htdig 3.1.0b2 on a Solaris 2.6 Sparc box.

I created an HTML file containing only the name Hewlett-Packard and indexed
it.

With a hyphen "-" in the valid_punctuation string in the configuration file
my db.wordlist contains:
-0
test c:1 l:241 i:1 w:75900 a:0
hewlettpacka c:1 l:632 i:1 w:368 a:0

I looked at a few other instances of db.wordlist on my system and found
lots of odd run-on words like this but none were longer than 12 characters
(some were shorter).

Without the hyphen the db.wordlist looks like this:
-0
test c:1 l:241 i:1 w:75900 a:0
hewlett c:1 l:632 i:1 w:368 a:0
packard c:1 l:724 i:1 w:276 a:0

I am not sure if this is better but this weekend I am going to try
re-indexing without any valid_punctuation and see what happens.

Has anyone else solved the hyphenation problem?

Thanks in advance,

Paul Lucas
Frost & Sullivan

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST