heddy Boubaker (firstname.lastname@example.org)
03 May 1999 16:40:06 +0200
<> "Geoff" == Geoff Hutchison <email@example.com> writes:
Geoff> At 7:33 AM -0400 4/30/99, Torsten Neuer wrote:
Geoff> Actually several major search engines, including AltaVista, seem to do
Geoff> exactly this already. The feature has been requested a few times, though
Geoff> never quite as specifically.
I'm happy to ear that ;-)
>> url_path_as_keywords: [true|false] # self-explaining
>> url_path_increment_factor: n # where n is of N
Geoff> You don't need url_path_as_keywords since setting the factor to 0 will
Geoff> effectively disable it.
Not quite, what I was thinking at the beginning was really an increment (x+n)
not a factor (x*n), so url_path_as_keywords is really needed, or a test on
url_path_increment_factor AND url_path_start_factor == 0 ??? (see below)
Geoff> If we're happy to limit it to only indexing "words" based on the
Geoff> slashes in the path, it's not very hard. The URL class in ht://Dig
Geoff> already allows you to grab only the path, so then you split it based
Geoff> on '/' and add the words using the Retriever class.
Geoff> I always wonder if we should worry about URLs like:
Geoff> http://wso.williams.edu/cafewso/ -> cafe ?
Geoff> http://www.foo.com/foo/bar/ -> foobar ?
This sound like humor but I didn't caught it sorry ;-) IMHO the answer to
your very metaphysical thoughts is: Yes we shall limit to words based on
slashes in the path!!
So to resume if such a feature should be implemented it'll need 3 more
url_path_as_keywords: [true|false] # default false
url_path_increment_factor: m # default 1
url_path_start_factor: n # default 1
So that urls like http://server/path-component1/pc2/pc3/filename.extension
will give keywords with weight:
path-component1 : n + (m*0)
pc2: n + (m*1)
pc3: n + (m*2)
filename: n + (m*3)
- heddy - ------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Mon May 03 1999 - 07:50:03 PDT