Re: htdig: htdig, muffin and javascript


Andrew Scherpbier (andrew@contigo.com)
Thu, 17 Sep 1998 13:34:22 -0700


Colin Viebrock wrote:
>
> Thus spake Geoff Hutchison (at 01:37 PM 9/17/98 -0400) ...
> >I guess the "problem" is this: ht://Dig interprets JavaScript in HTML
> >files as text. So if we can take the code Muffin uses to strip JavaScript
> >and add it to a "remove JavaScript" pass over the HTML files before
> >ht://Dig begins the real indexing, we'd be set.
>
> What about the "problem" of people using JS to pop up windows and other
> URLs and such? If you simply strip all the JS code from a document, you'll
> lose these links (and the info in them).

And your problem with this is.... :-) (Did I mention I don't like
Javascript?)

> And I haven't even mentioned JS that creates URL references on the fly, or
> based on other variables. Good luck coding a parser for that!

Exactly. This is definately non-trivial.
For this reason there is not a single search engine that I know of that will
find any pages at http://www.htmlguru.com/ except the front page...

> The only complete solution I can see is to write a program that emulates a
> browser and follows every possible link, button, image map, etc. possible
> from that page.

There is that GPL'd javascript interpreter... Believe me, I've thought about
it...

> [or do the digging on the server side ... but then what URL do you present
> to the user?]

Yup.
Just say "no" to javascript. :-)

P.S.: The best part of all this javascript stuff is that marketing normally
wants all the fancy stuff on their web pages but they *also* want all their
pages to be found by all the search engines. Try to explain that to them.
(What? Me bitter? Ha!)

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:48 PST