URL parsing error (was Re: htdig: Duplicate files with unique URLs)


Tim Frost (Tim.Frost@nz.eds.com)
Wed, 14 Jan 1998 13:39:35 +0000


Andrew,

There are several problems with htdig URL parsing in HTML.cc, rather
than in URL.cc. Some of the servers on our intranet have URL's with
apostrophe ('), rather than double-quotes ("). These are NOT correctly
handled by the parsing code in HTML.cc.

The code in HTML.cc does not strip the '#' or '?' inside quote marks.

I have created a patch which appears to resolve most of the problems
with URL formats.

In creating the patch, I found that there are a number of different
tags where a URL needs to be parsed. I am not happy with the patch,
because it is doing the same thing for each of those different tags,
rather than putting the common code in one place.

On Dec 10, 16:28, Andrew Scherpbier wrote:
> Subject: Re: htdig: Duplicate files with unique URLs
>
> Geoff Hutchison wrote:
> >
> > >I think that should be done in htlib/URL.cc. There is already
some code to
> > >deal with "/../" in URLs. Removal of "//" shouldn't be too hard.
> >
> > Wasn't there also talk of keeping a checksum or something for each
file and
> > only keeping one copy? In this particular situation this seems very
easy,
> > but I can think of plenty of other situations where it's not so
easy to
> > detect a duplicate file with a unique url.
>
> Yes. That would a very good solution.
>
> > While we're at it, though, I suggest some way of stripping off
strings like
> > "?D-A" used by Apache's new (1.3) directory sorting feature. I
think this
> > would probably go in htlib/URL.cc as well.
>
> I thought I was already stripping out URL parameters after a "?"...
>
> Anyway, attached is a diff against htlib/URL.{cc,h} that will remove
the
> double slashes from the path. I tested it briefly and it seemed to
work, but
> please test this some more!
>
> --
> Andrew Scherpbier <andrew@contigo.com>
> Contigo Software <http://www.contigo.com/>
>
>-- End of excerpt from Andrew Scherpbier

-- 
Tim Frost, Systems Engineer         Email: Tim.Frost@nz.eds.com
EDS (NZ) Ltd,                       Voice: +64 4 495-0504
P.O. Box 3647,                      Fax:   +64 4 495-0473
Wellington, New Zealand.

diff -ru htdig-3.0.8b2-orig/htdig/HTML.cc htdig-3.0.8b2/htdig/HTML.cc --- htdig-3.0.8b2-orig/htdig/HTML.cc Sun Dec 7 22:14:40 1997 +++ htdig-3.0.8b2/htdig/HTML.cc Thu Dec 18 21:02:03 1997 @@ -356,26 +356,8 @@ if (!*position) return; position++; - while (isspace(*position)) - position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && - *q != '>' && - !isspace(*q) && - *q != '?' && - *q != '#') - q++; - } - *q = '\0'; + if (!findend(position) ) + break; delete href; href = new URL(position, *base); in_ref = 1; @@ -396,20 +378,8 @@ position++; while (isspace(*position)) position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && *q != '>' && !isspace(*q)) - q++; - } - *q = '\0'; + if (!findend(position)) + break; retriever.got_anchor(position); position = q + 1; break; @@ -484,20 +454,8 @@ position++; while (isspace(*position)) position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && *q != '>' && !isspace(*q)) - q++; - } - *q = '\0'; + if (!findend(position) ) + break; retriever.got_image(position); break; } @@ -616,24 +574,8 @@ position++; while (isspace(*position)) position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && - *q != '>' && - !isspace(*q) && - *q != '?' && - *q != '#') - q++; - } - *q = '\0'; + if (!findend(position) ) + break; delete href; href = new URL(position, *base); if (doindex) @@ -668,24 +610,8 @@ position++; while (isspace(*position)) position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && - *q != '>' && - !isspace(*q) && - *q != '?' && - *q != '#') - q++; - } - *q = '\0'; + if (!findend(position) ) + break; delete href; href = new URL(position, *base); if (doindex) @@ -719,24 +645,8 @@ position++; while (isspace(*position)) position++; - if (*position == '"') - { - position++; - q = strchr(position, '"'); - if (!q) - break; - } - else - { - q = position; - while (*q && - *q != '>' && - !isspace(*q) && - *q != '?' && - *q != '#') - q++; - } - *q = '\0'; + if (!findend(position) ) + break; URL tempBase(position, *base); *base = tempBase; } @@ -746,4 +656,52 @@ default: return; // Nothing... } +} + +//***************************************************************************** +// char *HTML::findend( char* position) +// +char *HTML::findend( char* position) +{ + char *q, *t; + while (isspace(*position)) + position++; + // + // Allow either single quotes or double quotes + // around the URL itself + // + if (*position == '\'' || *position == '"') + { + position++; + q = strchr(position, position[-1]); // Match start + if (!q) + return(NULL); + // + // We seem to have matched the opening quote char + // + *q = '\0'; + // + // If a '?' or '#' is present in a quoted URL, + // treat that as the end of the URL, but we skip + // past the quote to parse the rest of the anchor. + // + // Is there a better way of looking for these? + // + if ((t = strchr(position, '#')) != NULL) + *t = '\0'; + if ((t = strchr(position, '?')) != NULL) + *t = '\0'; + } + else + { + q = position; + while (*q && + *q != '>' && + !isspace(*q) && + *q != '?' && + *q != '#') + q++; + *q = '\0'; + } + return q; } diff -ru htdig-3.0.8b2-orig/htdig/HTML.h htdig-3.0.8b2/htdig/HTML.h --- htdig-3.0.8b2-orig/htdig/HTML.h Sun Dec 7 22:14:40 1997 +++ htdig-3.0.8b2/htdig/HTML.h Thu Dec 18 20:54:57 1997 @@ -52,6 +52,7 @@ // Helper functions // void do_tag(Retriever &, String &); + char *findend(char *); }; #endif

---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:32 PST