Re: [htdig3-dev] Use of & as CGI variable separator vs. HTML 4.0


Subject: Re: [htdig3-dev] Use of & as CGI variable separator vs. HTML 4.0
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Feb 15 2000 - 11:06:23 PST


Hi, folks. After thinking about some recent messages expressing
concern about htsearch's compliance with HTML 4.0, and its potential
vulnerability to cross-site scripting, I've decided there are a couple
things worth doing.

One thing is to add some minimal SGML entity encoding capability to
htsearch in the 3.1.x branch, and to support $&(var) and $%(var), as
3.2 does. I have working code to do this, and I'll probably commit this
to the tree for an eventual 3.1.5 release. (No, I wasn't planning to
do this, but it seems there now may be a need for it.)

Another thing which I finally decided to do was to blow the dust off of
some e-mail discussions from last April, such as this one...

According to me:
> According to Budd, S.:
> > > From: Gilles Detillieux [SMTP:grdetil@scrc.umanitoba.ca]
> > > According to Torsten Neuer:
> > > > According to Gilles Detillieux:
> > > > >They also recommend using ";" as an alternate separator, so we should
> > > > >also handle
> > > > >
> > > > >?x=1;y=2
> > > > >and use this in Display::createURL(), if we're to follow their
> > > > >recommendations. Does anyone see a problem with using a semicolon in
> > > > >this manner?
> > > >
> > > > I disagree with that since AFAIK a semicolon does not necessarily need
> > > > to be url-encoded.
> > > > Could cause trouble if a semicolon is part of a parameter.
> > >
> > till now we could not use & as part of a parameter?
>
> I believe you're right! I was assuming that parameters were getting
> URL-encoded individually, and any "&" within a parameter got encoded
> to "%26" before it got appended to the query string. That's not
> how Display::createURL() currently functions. It concatenates all
> parameter strings, separated by "&", then encodes the whole string without
> changing the ampersands. So, any "&" within a parameter would cause that
> parameter to be truncated at that point when it's re-read as input in
> a subsequent search. The CGI input handling code does it correctly.
> It separates out the individual variables from the query string, and
> decodes only the value of each variable.
>
> Display::createURL() really ought to do just the opposite - it should
> encode the value of each variable separately, ensuring that any special
> characters, including "=" & "&" (and ";" if we use it later), get encoded.
> Then it should assemble the query string from these encoded variables,
> using whatever separator we choose to standardize on. The encodeURL()
> function allows you to give it a list of valid punctuation as a 2nd
> parameter, so it should be easy to change this. We can then change
> htlib/cgi.cc to handle that separator as well as "&". The vertical bar
> "|" gets encoded right now, so if we do the encoding properly, theres no
> reason we couldn't allow it as a variable separator too. We can also
> allow "&" or "&" as separators, if we translate SGML entities
> before splitting the variables.

Following that discussion, there were some references to RFCs 1738 &
2396, and recommendations about how URL parameters should be encoded.
At the time, I shelved this because I didn't have the time to read the
RFCs and figure out the distinction between reserved and unreserved
punctuation, and how htsearch should handle these. Geoff had quickly
patched encodeURL() to use the reserved set of characters as the default
valid list, and we left things at that, knowing more work needed to be
done on this.

I took some time this morning to sort it out a bit more. As I had
suggested earlier, it seems the proper way to handle URL parameters
is to encode them before you assemble the query string, and only allow
_unreserved_ punctuation characters through unencoded. This will prevent
conflicts between reserved punctuation in the individual parameters,
and the reserved punctuation used to piece together the query string.
I implemented this under 3.1.4, as the patch below. I think this is
a reasonable approach that does what I had discussed. It now allows
URL parameters to be separated by either "&" or ";", as well as "&"
(in fact "&<anything>;" will work - I hope that's not a problem), but
in the URLs that it generates, it will use only ";" instead of "&",
for HTML 4.0 compliance.

I'd appreciate some extra eyeballs to peer over this code. It compiles
quietly on my Red Hat 4.2 Intel Linux system (with gcc-c++-2.7.2.1-2), and
it seems to work correctly as far as I can tell. Please let me know if
I've done any "no-nos" in this, especially with the encodeInput() macro.
If not, I'll adapt it for 3.2 as well. Maybe the unreserved list should
go right in URL.h, as the default for encodeURL(). Any preferences
on this? Of course, with the parsing of the $%(var) template variable
extension, it should also pass this new unreserved list to the encodeURL()
call in outputVariable(), for consistency. I don't know how people
had intended to use this extension, but it seems more likely to me
that it would be used to encode individual parameter values rather
than whole query strings, so any reserved character should be encoded.
I think we'll also want to change all the default follow-up search forms
(nomatch.html, syntax.html, header.html, wrapper.html) to use $&(WORDS),
$&(EXCLUDE) and %&(RESTRICT) as the default values for their corresponding
input tags. Have I missed any?

Feedback is welcome.

--- htlib/cgi.cc.oldparm Thu Dec 9 18:28:47 1999
+++ htlib/cgi.cc Tue Feb 15 10:46:51 2000
@@ -92,7 +92,7 @@ cgi::init(char *s)
         //
         // Now we need to split the line up into name/value pairs
         //
- StringList list(results, '&');
+ StringList list(results, "&;");
         
         //
         // Each name/value pair now needs to be added to the dictionary
--- htsearch/Display.cc.oldparm Tue Feb 15 09:16:54 2000
+++ htsearch/Display.cc Tue Feb 15 11:46:07 2000
@@ -24,6 +24,20 @@ static char RCSid[] = "$Id: Display.cc,v
 #include "HtURLCodec.h"
 #include "HtWordType.h"
 
+
+// Unreserved punctuation allowed unencoded in URLs. We use a more restricted
+// list of unreserved characters than allowed by RFC 2396 (which revises and
+// replaces RFC 1738), because it can't hurt to encode any of these
+// characters, and they can pose problems in some contexts. RFC 2396 says
+// that only alphanumerics, the unreserved characters "-_.!~*'(),", and
+// reserved characters used for their reserved purposes may be used
+// unencoded within a URL. We encode reserved characters because we now
+// encode URL parameter values individually before piecing together the whole
+// query string using reserved characters.
+
+static char unreserved[] = "-_.!~*";
+
+
 //*****************************************************************************
 //
 Display::Display(char *indexFile, char *docFile)
@@ -560,6 +574,7 @@ Display::createURL(String &url, int page
 {
     String s;
     int i;
+#define encodeInput(name) (s = input->get(name), encodeURL(s, unreserved), s.get())
 
     if (strlen(config["script_name"]) != 0) {
       url << config["script_name"];
@@ -570,34 +585,35 @@ Display::createURL(String &url, int page
     url << '?';
 
     if (input->exists("restrict"))
- s << "restrict=" << input->get("restrict") << '&';
+ url << "restrict=" << encodeInput("restrict") << ';';
     if (input->exists("exclude"))
- s << "exclude=" << input->get("exclude") << '&';
+ url << "exclude=" << encodeInput("exclude") << ';';
     if (input->exists("config"))
- s << "config=" << input->get("config") << '&';
+ url << "config=" << encodeInput("config") << ';';
     if (input->exists("method"))
- s << "method=" << input->get("method") << '&';
+ url << "method=" << encodeInput("method") << ';';
     if (input->exists("format"))
- s << "format=" << input->get("format") << '&';
+ url << "format=" << encodeInput("format") << ';';
     if (input->exists("sort"))
- s << "sort=" << input->get("sort") << '&';
+ url << "sort=" << encodeInput("sort") << ';';
     if (input->exists("matchesperpage"))
- s << "matchesperpage=" << input->get("matchesperpage") << '&';
+ url << "matchesperpage=" << encodeInput("matchesperpage") << ';';
     if (input->exists("keywords"))
- s << "keywords=" << input->get("keywords") << '&';
+ url << "keywords=" << encodeInput("keywords") << ';';
     if (input->exists("words"))
- s << "words=" << input->get("words") << '&';
+ url << "words=" << encodeInput("words") << ';';
     StringList form_vars(config["allow_in_form"], " \t\r\n");
     for (i= 0; i < form_vars.Count(); i++)
     {
       if (input->exists(form_vars[i]))
       {
- s << form_vars[i] << '=' << input->get(form_vars[i]) << '&';
+ s = form_vars[i];
+ encodeURL(s, unreserved); // shouldn't be needed, but just in case
+ url << s << '=';
+ url << encodeInput(form_vars[i]) << ';';
       }
     }
- s << "page=" << pageNumber;
- encodeURL(s);
- url << s;
+ url << "page=" << pageNumber;
 }
 
 //*****************************************************************************

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Feb 15 2000 - 11:09:20 PST