[htdig] PR #657: Misinterpretation of URL parameters


Torsten Neuer (tneuer@inwise.de)
Sun, 26 Sep 1999 18:56:48 +0200


The bug results from the behaviour of transSGML() in HTML.cc which
is not really suitable for use with URLs.

1) transSGML will not correctly translate "&" into "&" (as re-
   quired by HTML 4.0 draft standard) if the configuration directive
   "translate_amp" is not set, i.e. the URL parameter "?i=1&p=1"
   will not be translated into "?i=1&p=1".

2) transSGML will corrupt any URL that uses the traditional URL
   parameter delimiter by the attempt to translate a non-existing
   entity which results in a space character, e.g. the URL para-
   meter "?i=1&p=1" will be truncated to "?i=1".

Following is a quick fix for this problem. It affects the behaviour
of following functions (and those which use them):

- SGMLEntities::translate()
  Will return an ampersand instead of a space for unrecognized en-
  tities, thus leaving single ampersand characters "as is" (which
  will affect document text as well!).

- SGMLEntities::translateAndUpdate()
  Will restore the text pointer to the character after the ampersand
  for unrecognized entities.

- HTML::transSGML()
  Will translate any "&" entity regardless of the settings of
  "translate_amp".

As stated above, this is only a quick fix, which might not work for
all cases (but it works for me so far). ,-)

cheers,
  Torsten

*** HTML.cc~ Sun Sep 26 18:05:07 1999
--- HTML.cc Sun Sep 26 18:43:53 1999
***************
*** 1113,1122 ****
      convert = 0;
      while (*text)
      {
! if (*text == '&')
! convert << SGMLEntities::translateAndUpdate(text);
! else
! convert << *text++;
      }
      return convert.get();
  }
--- 1113,1127 ----
      convert = 0;
      while (*text)
      {
! if (*text == '&')
! {
! convert << SGMLEntities::translateAndUpdate(text);
! if( !strncmp(text,"amp;",4) )
! text += 4;
! }
! else
! convert << *text++;
      }
      return convert.get();
  }

*** SGMLEntities.cc~ Sun Sep 26 18:30:53 1999
--- SGMLEntities.cc Sun Sep 26 18:39:28 1999
***************
*** 165,171 ****
      }
      else
      {
! return ' '; // Unrecognized entity. Change it into a
space...
      }
  }
  
--- 165,171 ----
      }
      else
      {
! return '&'; // Unrecognized entity. Return just an
ampersand...
      }
  }
  
***************
*** 280,284 ****
      
      if (*entityStart == ';')
        entityStart++; // A final ';' is used up.
! return translate(entity);
  }
--- 280,287 ----
      
      if (*entityStart == ';')
        entityStart++; // A final ';' is used up.
! unsigned char e = translate(entity);
! if( e == '&' && !translate_amp )
! entityStart = orig + 1; // Catch unrecognized entities...
! return e;
  }

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Sun Sep 26 1999 - 10:01:41 PDT