Re: htdig: htdig, muffin and javascript

Andrew Scherpbier (
Thu, 17 Sep 1998 13:34:22 -0700

Colin Viebrock wrote:
> Thus spake Geoff Hutchison (at 01:37 PM 9/17/98 -0400) ...
> >I guess the "problem" is this: ht://Dig interprets JavaScript in HTML
> >files as text. So if we can take the code Muffin uses to strip JavaScript
> >and add it to a "remove JavaScript" pass over the HTML files before
> >ht://Dig begins the real indexing, we'd be set.
> What about the "problem" of people using JS to pop up windows and other
> URLs and such? If you simply strip all the JS code from a document, you'll
> lose these links (and the info in them).

And your problem with this is.... :-) (Did I mention I don't like

> And I haven't even mentioned JS that creates URL references on the fly, or
> based on other variables. Good luck coding a parser for that!

Exactly. This is definately non-trivial.
For this reason there is not a single search engine that I know of that will
find any pages at except the front page...

> The only complete solution I can see is to write a program that emulates a
> browser and follows every possible link, button, image map, etc. possible
> from that page.

There is that GPL'd javascript interpreter... Believe me, I've thought about

> [or do the digging on the server side ... but then what URL do you present
> to the user?]

Just say "no" to javascript. :-)

P.S.: The best part of all this javascript stuff is that marketing normally
wants all the fancy stuff on their web pages but they *also* want all their
pages to be found by all the search engines. Try to explain that to them.
(What? Me bitter? Ha!)

Andrew Scherpbier <>
Contigo Software <>
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:48 PST