Re: [htdig] patch to parse URLs?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 11 Aug 1999 14:18:34 -0500 (CDT)


According to Leonard J. Hunt:
> We are now using htdig to index our discussion group, which
> puts user IDs into URLs like this:
> http://www.learn2.com/cgi-bin/learnline?23@^3290@14%40
> I am looking for a patch for htdig to take the user id
> (^3290 in this case) out of the URL before it gets indexed
> as a unique url. I set the server_max_docs so that htdig
> will finish indexing at some point as a temporary measure,
> but there's no guarantee that everything has been indexed
> and there are "unique" URLs that point to the same place.

You could probably fit something pretty easily into the
URL::normalizePath() function in htlib/URL.cc. It would end up looking
a bit like the code to strip out redundant "/./" stuff from the URL.
The pattern matching would be a bit more complicated, because you
certainly want to avoid stripping out stuff that doesn't apply, e.g. from
other CGI URLs, but the code for removing a section from the string
should be the same. You don't need to bother with the pathend stuff,
because in this case you are stripping stuff after the "?".

That's the simple solution, specific to your particular problem.

However, requests like this come up every now and then. It would almost
be worthwhile adding a more general, regex-based mechanism for stripping
off parts of URLs when normalizing. I'd see the need for something that
uses triplets. E.g.:

    strip_url_parts: a b c d e f

would mean for any URL that matches pattern a, replace pattern b with
string c, and likewise for subsequent triplets. Or in sed notation,

        /a/s/b/c/

The problem above would then require something like this:

    strip_url_parts: /cgi-bin/learnline?[0-9]*@^[0-9]*@ @^[0-9]*@ @@

A simpler approach, but with more overhead, would be an option for
passing all URLs to an external command for filtering and/or editing.

Either way, implementing this, would take a fair bit more work than
the simple case specific to your needs, but it would benefit many.
If there are any takers who'd like to implement this, please speak up.
(The discussion should probably move to htdig3-dev@htdig.org for follow-up
on this suggestion.) And no, I'm not volunteering! :-)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Aug 11 1999 - 12:19:21 PDT