htdig: htdig, muffin and javascript


Jed Michnowicz (jmichno@homer.providence.edu)
Sun, 13 Sep 1998 15:58:43 -0400


About three weeks ago I experimented with muffin to filter out javascript
from the search indexes (it did not work that well). Here is my experience:

It is known that htdig does not parse javascript properly. The search
summaries disply the ugly javascript code instead of the documents true
summary. It has been suggested that muffin be used to filter out the
javascript junk.

I have installed muffin (http://muffin.doit.org) and it does filter out the
javascript, but from what I have found, muffin is more of a personl proxy
and does not work under a high load. Muffin tends to return incorrect info
such as the wrong URL, or the wrong page data when many requests are made
to it.

To get around this, I added a sleep statement in the document retreiver
loop (Retreiver.cc) so it would be forced to wait 1 second between
requests. Although very slow, the muffin htdig combo worked until I
indexed a large site.

After indexing 40,000+ documents, muffin gave up. Muffin tends to eat
memory as it goes along and then just stops responding.

Basically, the muffin/htdig combo does not really work that well. I was
wondering if anybody knows of a better way to filter out javascript. If
so, could this be incorporated into htdig.

Any Ideas???
-Jed Michnowicz

END this message NOW
Jed Michnowicz...A part time genius, but a full time moron.
jmichno {at} homer {dot} providence {dot} edu
http://studentweb.providence.edu/~jmichno
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:43 PST