Torsten Neuer (email@example.com)
Mon, 21 Jun 1999 19:17:32 +0200
According to Geoff Hutchison:
>Daniel Naber wrote:
>> can you say how difficult it is to add this feature? If you point me to
>> the files
>> to change, and if it's not to difficult, I could try to add this.
>I did send a response, and it's not too difficult. But see below.
>> An example of what I mean: Someone searches for "foobar" and gets
>> www.blah.com/~blubb/foobarblah.html as a result, even if that file
>> contain the string "foobar".
>Now the initial request was more along these lines (which is easier):
>The request was to match "foo" or "bar" or "blah." For your example,
>you'd have to decide if "~" is to be stripped out (I'd say yes) and
>whether you'll just go with prefix matching to get "foobar" from
>If someone submits a function that splits a URLs into words, I'll finish
>it. It's a matter of a time tradeoff--I'd rather work on things other
>than that function and it's probably faster for me to put in the correct
>place (in Retriever.cc).
To add a few quick thoughts on that URL splitting function:
- I'll assume the protocol identifier and the server name to be
- I'll assume the file extension of the document to be stripped
This could easily be achieved for trivial URLs with the upcoming
regexp support ;-)
However.. let's think of some more complex URLs:
^^^^ ^^^ ^^^ ^^^ !!!! !! ^^^ ^^^^ ?? ??? ??? !!!! ^^^^
(^ = stripped out / ! = included / ? = included, but confusing)
If we like to have HTTP GET parameters included in this function,
we could run into trouble. But without the parameters the search
method might not be useful for sites with dynamic contents.
-- InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH Waldhofstraße 14 Tel: +49-4101-403605 D-25474 Ellerbek Fax: +49-4101-403606 E-Mail: firstname.lastname@example.org Internet: http://www.inwise.de ------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Mon Jun 21 1999 - 09:46:48 PDT