Re: [htdig] match part of URL?


Torsten Neuer (tneuer@inwise.de)
Mon, 21 Jun 1999 19:17:32 +0200


According to Geoff Hutchison:
>Daniel Naber wrote:
>> can you say how difficult it is to add this feature? If you point me to
>> the files
>> to change, and if it's not to difficult, I could try to add this.
>
>I did send a response, and it's not too difficult. But see below.
>
>> An example of what I mean: Someone searches for "foobar" and gets
>> www.blah.com/~blubb/foobarblah.html as a result, even if that file
>> doesn't
>> contain the string "foobar".
>
>Now the initial request was more along these lines (which is easier):
>
>http://www.foo.com/bar/blah.html
>
>The request was to match "foo" or "bar" or "blah." For your example,
>you'd have to decide if "~" is to be stripped out (I'd say yes) and
>whether you'll just go with prefix matching to get "foobar" from
>"foobarblah"
>
>If someone submits a function that splits a URLs into words, I'll finish
>it. It's a matter of a time tradeoff--I'd rather work on things other
>than that function and it's probably faster for me to put in the correct
>place (in Retriever.cc).

To add a few quick thoughts on that URL splitting function:
- I'll assume the protocol identifier and the server name to be
  stripped out.
- I'll assume the file extension of the document to be stripped
  out.

This could easily be achieved for trivial URLs with the upcoming
regexp support ;-)

However.. let's think of some more complex URLs:
- http://www.foo.com/oops.up/?bar=http://no.way.org/oops.html
  ^^^^ ^^^ ^^^ ^^^ !!!! !! ^^^ ^^^^ ?? ??? ??? !!!! ^^^^

(^ = stripped out / ! = included / ? = included, but confusing)

If we like to have HTTP GET parameters included in this function,
we could run into trouble. But without the parameters the search
method might not be useful for sites with dynamic contents.

So what?

-Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Jun 21 1999 - 09:46:48 PDT