Re: htdig: (Not) translating entities


Marjolein Katsma (webmaster@javawoman.com)
Tue, 12 Jan 1999 06:27:34 +0100


At 20:56 1999-01-11 -0500, Geoff Hutchison wrote:
>
>On Mon, 11 Jan 1999, Marjolein Katsma wrote:
>
>> Some digging revealed tha both &lt; and &gt; are translated, and then '<'
>> is converted to a space... not what I needed.
>> For pages with code samples of such laguages (HTML and other tag-based
>> languages) the automatic translation of such entities actually gets in the
>> way - so I made it configurable. Also useful for pages/sites with
>> mathematical formulae which should be recognizable in the excerpts.
>
>This was a kludge to side-step a nasty bug in the HTML parser. If we
>didn't remove the '<' it would call it the beginning of a tag and try to
>parse the tag. Not nice either.
>
>While your patch is nice, it also side-steps the issue in the HTML parser.
>One of these days someone needs to go back and figure out an optimal
>approach to its tasks--translate SGML entities, operate on tags, and form
>the excerpts. Right now we're doing them in that order, but this clearly
>causes problems.

Well, it only causes problems if you *do* translate &lt; and &gt;. But
these entities weren't invented for nothing: just as they prevent problems
with a browser (parse and display) they also prevent problems with other
programs like search engines (parse and index). IMO they should never be
translated; same for '&amp'. Only difference: 'quot' (this is only really
needed in attribute values).

It seems to me that changing '<' to a space isn't the right way to solve
the problem caused by translating the entities in the first place. I only
left that in place to keep default behavior of the program unchanged. The
excerpt of my configuration file shows what I think should be the 'normal'
way to treat these entities.

>
>You also noted that there were some SGML equivalents not present in the
>current file. I'll gladly accept a patch for that (or if you're too busy,
>a URL to an appropriate reference). ;-)

Will do some digging ;-)

>
>Cheers,
>-Geoff Hutchison
>Williams Students Online
>http://wso.williams.edu/
>
>----------------------------------------------------------------------
>To unsubscribe from the htdig mailing list, send a message to
>htdig-request@sdsu.edu containing the single word "unsubscribe" in
>the body of the message.

Marjolein Katsma webmaster@javawoman.com
Java Woman - http://javawoman.com/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 13 1999 - 09:13:05 PST