Subject: Re: [htdig] Using htdig to "tidy" HTML
From: Geoff Hutchison (firstname.lastname@example.org)
Date: Mon Jun 05 2000 - 12:16:16 PDT
On Mon, 5 Jun 2000, Rzepa, Henry wrote:
> retrieved by htdig from the start_url directory. This can be done simply
> using Dave Raggett's program Tidy, which seems pretty reliable (if not
> always 100%). However, invoking Tidy seems to require it be defined
> in conjunction with an external parser for the MIME type text/html.
> This means entirely over-riding the internal text/html htdig parser.
Alas the problem here is that invoking the external converter feature
would produce an infinite loop. Setting text/html -> text/html would just
call the converter again. :-( [The feature here is that you might have a
converter to gunzip files which then produces PDF files to go to another
I guess the ExternalParser code could be changed so that a converter
producing text/plain or text/html (or any future internal mime-types)
passes it off to the internal code.
That said, I thought there was some sort of command-line tool to "spider"
with Tidy already. Maybe that was something dreamed up by one of my
friends at school. Still, it seems like a shell script around Tidy would
-- -Geoff Hutchison Williams Students Online http://wso.williams.edu/
------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon Jun 05 2000 - 10:44:34 PDT