[htdig] A problem in the HTML parser of htdig 3.1.5


Subject: [htdig] A problem in the HTML parser of htdig 3.1.5
From: Russell Cox (Russell_Cox@zd.com)
Date: Fri Apr 07 2000 - 08:13:44 PDT


Hello

I've recently been trying out htdig 3.1.5 to index the contents of our site
and have come across a problem with htdig's HTML parser.

The problem basically occurs when an SSI call is placed within an attribute
value, in the cases we've spotted this has been in images SRC attribute
value. So to give an example of the sort of text that might occur in the
value, on one problem page we have: <IMG
SRC="http://ad.uk.doubleclick.net/ad/www.gsbuttons.co.uk/jakarta-bargain;sz
=88x31;ord=<!--#echo var="REMOTE PORT"-->?" ALT="Bargains @Jakarta"
BORDER=0 WIDTH=88 HEIGHT=31 HSPACE=0 VSPACE=0>. When viewed by a browser
apache has processed the SSI so it obviously isn't a problem, but when we
dig by file that instruction isn't processed and Htdigs parser makes a
mistake. The parser closes the image tag at the first > it finds so this
closes the IMG tag at the end of the SSI call the result is that half of
the actual image tag occurs in the text stored in the database.

I've checked RFC1866 for the HTML spec in this matter, and > is a valid
character within a string attribute value (admitedly the double quotes
around the REMOTE PORT are invalid, but they haven't caused a problem
here). We have a work around ready for our site which will eliminate the
problem, but I thought it might be useful to comment on this issue with the
parser in case it hadn't been mentioned before (a few searches failed to
turn up anything specifically related).

Regards

Russell Cox
Web Developer
ZDNet UK (www.zdnet.co.uk)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Apr 07 2000 - 07:15:15 PDT