Re: [htdig] External parsers: VRML added: Following <embed> tags?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 4 Jun 1999 14:23:46 -0500 (CDT)


According to Geoff Hutchison:
> On Fri, 4 Jun 1999, Rzepa, Henry wrote:
>
> > If anyone can give us some hints as to how to modify htdig to follow
> > <embed> as well as <a> tags, we would be most grateful!!
>
> This patch has not even been tested to see if it compiles. But it should
> do what you ask.
..
> *************** HTML::do_tag(Retriever &retriever, Strin
> *** 1097,1102 ****
> --- 1097,1202 ----
> break;
> }
>
> + case 24: // embed
..

case 24 is identical to case 25, as far as I can tell, so the two can be
merged together. Why duplicate code?

> + case 25: // object
> + {
> + which = -1;
> + int pos = attrs.FindFirstWord(position, which, length);

This will match any of "src", "href" or "name". Is this all right?
If the <embed> and <object> tags both use only src=..., you could use
srcMatch.FindFirstWord(...) instead.

> + if (pos < 0 || which != 0)
> + break;
> + position += pos + length;
> + while (*position && *position != '=')
> + position++;
> + if (!*position)
> + break;
> + position++;
> + while (isspace(*position))
> + position++;
> + //
> + // Allow either single quotes or double quotes
> + // around the URL itself
> + //
> + if (*position == '"'||*position == '\'')
> + {
> + position++;
> + q = strchr(position, position[-1]);
> + if (!q)
> + break;
> + //
> + // We seem to have matched the opening quote char
> + // Mark the end of the quotes as our endpoint, so
> + // that we can continue parsing after the current
> + // text
> + //
> + *q = '\0';
> + //
> + // If a '#' is present in a quoted URL,
> + // treat that as the end of the URL, but we skip
> + // past the quote to parse the rest of the anchor.
> + //
> + if ((t = strchr(position, '#')) != NULL)
> + *t = '\0';
> + }
> + else
> + {
> + q = position;
> + while (*q && *q != '>' && !isspace(*q))
> + q++;
> + *q = '\0';
> + }
> + retriever.got_href(position);

This last function call won't work. You'd need to do something like:

            if (dofollow)
            {
                URL *href = new URL(position, *base);
                retriever.got_href(*href, "");
                delete href;
            }

> + break;
> + }
> +
> default:
> return; // Nothing...
> }
>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Jun 04 1999 - 11:37:31 PDT