Re: [htdig3-dev] Re: [htdig] New feature proposal

heddy Boubaker (
03 May 1999 16:40:06 +0200

 <> "Geoff" == Geoff Hutchison <> writes:

Geoff> At 7:33 AM -0400 4/30/99, Torsten Neuer wrote:

Geoff> Actually several major search engines, including AltaVista, seem to do
Geoff> exactly this already. The feature has been requested a few times, though
Geoff> never quite as specifically.

 I'm happy to ear that ;-)
>> url_path_as_keywords: [true|false] # self-explaining
>> url_path_increment_factor: n # where n is of N
Geoff> You don't need url_path_as_keywords since setting the factor to 0 will
Geoff> effectively disable it.

 Not quite, what I was thinking at the beginning was really an increment (x+n)
 not a factor (x*n), so url_path_as_keywords is really needed, or a test on
 url_path_increment_factor AND url_path_start_factor == 0 ??? (see below)
Geoff> If we're happy to limit it to only indexing "words" based on the
Geoff> slashes in the path, it's not very hard. The URL class in ht://Dig
Geoff> already allows you to grab only the path, so then you split it based
Geoff> on '/' and add the words using the Retriever class.
Geoff> I always wonder if we should worry about URLs like:
Geoff> -> cafe ?
Geoff> -> foobar ?

 This sound like humor but I didn't caught it sorry ;-) IMHO the answer to
 your very metaphysical thoughts is: Yes we shall limit to words based on
 slashes in the path!!

 So to resume if such a feature should be implemented it'll need 3 more
url_path_as_keywords: [true|false] # default false
url_path_increment_factor: m # default 1
url_path_start_factor: n # default 1

So that urls like http://server/path-component1/pc2/pc3/filename.extension
will give keywords with weight:
path-component1 : n + (m*0)
pc2: n + (m*1)
pc3: n + (m*2)
filename: n + (m*3)



- heddy - ------------------------------------ To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Mon May 03 1999 - 07:50:03 PDT