Re: [htdig] A problem in the HTML parser of htdig 3.1.5


Subject: Re: [htdig] A problem in the HTML parser of htdig 3.1.5
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Fri Apr 07 2000 - 09:50:07 PDT


On Fri, 7 Apr 2000, Russell Cox wrote:

> BORDER=0 WIDTH=88 HEIGHT=31 HSPACE=0 VSPACE=0>. When viewed by a browser
> apache has processed the SSI so it obviously isn't a problem, but when we
> dig by file that instruction isn't processed and Htdigs parser makes a
> mistake. The parser closes the image tag at the first > it finds so this
> closes the IMG tag at the end of the SSI call the result is that half of
> the actual image tag occurs in the text stored in the database.

My first suggestion is that this will only come up if you've set SSI to
run in .html files. The local file support is very picky about this for
*exactly* this reason--it doesn't parse dynamic content. (It *can't*
really parse SSI because it would need to use Apache's code to do it
properly.)

So if you have a lot of dynamic content, you really shouldn't be using
local file support--it won't be indexing it as the user sees it.

> I've checked RFC1866 for the HTML spec in this matter, and > is a valid
> character within a string attribute value (admitedly the double quotes
> around the REMOTE PORT are invalid, but they haven't caused a problem

This RFC is for version 2.0 of HTML. Checking the 4.01 spec, it doesn't
expressly forbid this, but it also says that character entities should be
interpreted as the appropriate characters. So if you wanted to include a <
character, you should use a &lt;, etc.

At least that's my interpretation.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Apr 07 2000 - 07:34:53 PDT