Re: submission of patches for htdig


Tim Frost (Tim.Frost@nz.eds.com)
Thu, 29 Jan 1998 09:10:15 +0000


Andrew,

The patch is attached, together with an explanation of same.

It was prompted by the fact that a large percentage of references that
htdig was finding in our intranet were being rejected because the
parsing was wrong.

Early this month, I saw discussion in the list of a problem related to
the # and ? being stripped or not, and suspected that the issue was
related to quotes. I had been looking at HTML.cc, while the discussion
in that thread centred around URL.cc

Tim

On Jan 28, 9:28, Andrew Scherpbier wrote:
> Subject: Re: submission of patches for htdig
> Tim Frost wrote:
> >
> > Andrew,
> >
> > There are problems with htdig handling of quoted URLs, for which I
have
> > produced a possible fix. I have that fix as a context diff against
the
> > 3.0.8b2 source tree. Should I send that patch to you, or to the
htdig
> > mailing list, or is there an official bug-reports/bug-fixes address
to
> > which I should send it?
> >
> > Tim
>
> Please send it to me. I'll put it into the main source tree.
> Thanks!
>
>
> --
> Andrew Scherpbier <andrew@contigo.com>
> Contigo Software <http://www.contigo.com/>
>-- End of excerpt from Andrew Scherpbier

-- 
Tim Frost, Systems Engineer         Email: Tim.Frost@nz.eds.com
EDS (NZ) Ltd,                       Voice: +64 4 495-0504
P.O. Box 3647,                      Fax:   +64 4 495-0473
Wellington, New Zealand.

diff -u htdig-3.0.8b2/htdig/HTML.cc-orig htdig-3.0.8b2/htdig/HTML.cc --- htdig-3.0.8b2/htdig/HTML.cc-orig Sun Dec 7 22:14:40 1997 +++ htdig-3.0.8b2/htdig/HTML.cc Fri Jan 9 21:24:03 1998 @@ -309,7 +309,7 @@ HTML::do_tag(Retriever &retriever, String &tag) { char *position = tag.get() + 1; // Skip the '<' - char *q; + char *q, *t; int which, length; while (isspace(*position)) @@ -358,12 +358,34 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { @@ -374,8 +396,8 @@ *q != '?' && *q != '#') q++; + *q = '\0'; } - *q = '\0'; delete href; href = new URL(position, *base); in_ref = 1; @@ -396,20 +418,42 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { q = position; while (*q && *q != '>' && !isspace(*q)) q++; + *q = '\0'; } - *q = '\0'; retriever.got_anchor(position); position = q + 1; break; @@ -484,20 +528,42 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { q = position; while (*q && *q != '>' && !isspace(*q)) q++; + *q = '\0'; } - *q = '\0'; retriever.got_image(position); break; } @@ -616,12 +682,34 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { @@ -632,8 +720,8 @@ *q != '?' && *q != '#') q++; + *q = '\0'; } - *q = '\0'; delete href; href = new URL(position, *base); if (doindex) @@ -668,12 +756,34 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { @@ -684,8 +794,8 @@ *q != '?' && *q != '#') q++; + *q = '\0'; } - *q = '\0'; delete href; href = new URL(position, *base); if (doindex) @@ -719,12 +829,34 @@ position++; while (isspace(*position)) position++; - if (*position == '"') + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '"'||*position == '\'') { position++; - q = strchr(position, '"'); + q = strchr(position, position[-1]); if (!q) break; + // + // We seem to have matched the opening quote char + // Mark the end of the quotes as our endpoint, so + // that we can continue parsing after the current + // text + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; } else { @@ -735,8 +867,8 @@ *q != '?' && *q != '#') q++; + *q = '\0'; } - *q = '\0'; URL tempBase(position, *base); *base = tempBase; }

Andrew,

There are several problems with htdig URL parsing in HTML.cc, rather than in URL.cc. Some of the servers on our intranet have URL's with apostrophe ('), rather than double-quotes ("). These are NOT correctly handled by the parsing code in HTML.cc.

The code in HTML.cc does not strip the '#' or '?' inside quote marks.

I have created a patch which appears to resolve most of the problems with URL formats.

In creating the patch, I found that there are a number of different tags where a URL needs to be parsed. I am not happy with the patch, because it is doing the same thing for each of those different tags, rather than putting the common code in one place.

On Dec 10, 16:28, Andrew Scherpbier wrote: > Subject: Re: htdig: Duplicate files with unique URLs > > Geoff Hutchison wrote: > > > > >I think that should be done in htlib/URL.cc. There is already some code to > > >deal with "/../" in URLs. Removal of "//" shouldn't be too hard. > > > > Wasn't there also talk of keeping a checksum or something for each file and > > only keeping one copy? In this particular situation this seems very easy, > > but I can think of plenty of other situations where it's not so easy to > > detect a duplicate file with a unique url. > > Yes. That would a very good solution. > > > While we're at it, though, I suggest some way of stripping off strings like > > "?D-A" used by Apache's new (1.3) directory sorting feature. I think this > > would probably go in htlib/URL.cc as well. > > I thought I was already stripping out URL parameters after a "?"... > > Anyway, attached is a diff against htlib/URL.{cc,h} that will remove the > double slashes from the path. I tested it briefly and it seemed to work, but > please test this some more! > > -- > Andrew Scherpbier <andrew@contigo.com> > Contigo Software <http://www.contigo.com/> > >-- End of excerpt from Andrew Scherpbier

-- 
Tim Frost, Systems Engineer         Email: Tim.Frost@nz.eds.com
EDS (NZ) Ltd,                       Voice: +64 4 495-0504
P.O. Box 3647,                      Fax:   +64 4 495-0473
Wellington, New Zealand.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:33 PST