Gilles Detillieux (firstname.lastname@example.org)
Tue, 16 Mar 1999 13:47:41 -0600 (CST)
According to J. op den Brouw:
> as the title says, I had problems with comments in HTML, and the file
> What is the problem. Well, most web pages at are school are produces by
> people who don't know s**t about HTML.
> They produce comments like:
> <!hello -->
> This is
> done by finding the -- just before the > (comments end with -->). But in
> the first comment
> case above it fails. Anyway, it messes my indexing. The trick is (I
> HOPE) that line 161:
> q = (unsigned char*)strstr((char *)position, "--");
> should be changed in:
> q = (unsigned char*)strstr((char *)position, "-->");
> It finds the first occurence of --> so don't recurse comments. Anyway,
> it works on my htdig system.
This isn't quite right. We had a big discussion about this two weeks ago.
The HTML standard allows white space (even newlines) between the closing
"--" and ">" of a comment. The trick is to gobble up any extra dashes
after the first two, and then skip white space. If that doesn't leave
you at a ">", I think you have to start over again, scanning for the next
> Another problem is that M$ Frontpage 98 in combination with Frontpage
> Server Extension don't do
> <AREA> tags. They create a webbot (inside a comment). If the webbot has
> links, these links don't
> get indexed. Of couse this is a M$ / user problem, it just that you know
> of it.
can enhance the HTML parser to deal with these webbot links reliably,
without breaking anything else, go for it. Otherwise, it'll remain a
problem, until M$ learns to adhere to standards other than their own. ;-)
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Mar 16 1999 - 12:04:51 PST