Re: [htdig] Using htdig to "tidy" HTML


Subject: Re: [htdig] Using htdig to "tidy" HTML
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Mon Jun 05 2000 - 12:16:16 PDT


On Mon, 5 Jun 2000, Rzepa, Henry wrote:

> retrieved by htdig from the start_url directory. This can be done simply
> using Dave Raggett's program Tidy, which seems pretty reliable (if not
> always 100%). However, invoking Tidy seems to require it be defined
> in conjunction with an external parser for the MIME type text/html.
> This means entirely over-riding the internal text/html htdig parser.

Alas the problem here is that invoking the external converter feature
would produce an infinite loop. Setting text/html -> text/html would just
call the converter again. :-( [The feature here is that you might have a
converter to gunzip files which then produces PDF files to go to another
converter.]

I guess the ExternalParser code could be changed so that a converter
producing text/plain or text/html (or any future internal mime-types)
passes it off to the internal code.

That said, I thought there was some sort of command-line tool to "spider"
with Tidy already. Maybe that was something dreamed up by one of my
friends at school. Still, it seems like a shell script around Tidy would
be better.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jun 05 2000 - 10:44:34 PDT