[htdig] Re: valid_punctuation setting (was: extra_word_characters (PR#952))

Subject: [htdig] Re: valid_punctuation setting (was: extra_word_characters (PR#952))
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Nov 24 2000 - 08:26:00 PST

According to Tomas Frydrych:
> I do have one question though; when defining valid_punctuation, do
> I have to include ' ' (i.e. space), or is ' ' always included, and if I
> have to include it explicitely, where/how do I put into in the string?

No, white space characters (space, tab, newline) are treated separately
from valid_punctuation and any other punctuation characters. The htdig
parser uses the C library function isspace() to test if a character is
a white space character, and these are usually defined by your locale,
although with any ASCII or ISO character set these will be pretty much
the standard three characters above, and perhaps a few more obscure ones.
It would not make sense to add a space to valid_punctuation, nor can you.

The valid_punctuation characters are those that are allowed within a
compound word. Historically, a word like "post-doctoral" was indexed
only as "postdoctoral" if the "-" was in valid_punctuation. In more
recent versions, it is indexed as "postdoctoral", "post" and "doctoral".
But you see how valid_punctuation characters have a special meaning within
a word. They don't cause a distinct break between words the way that any
other punctuation character would, or the way that white space would.
E.g. the comma "," is not normally included in valid_punctuation so it
always breaks words apart, while the hyphen or apostrophe can appear
within a word (in English, in any case).

