About three weeks ago I experimented with muffin to filter out javascript
from the search indexes (it did not work that well). Here is my experience:

It is known that htdig does not parse javascript properly. The search
summaries disply the ugly javascript code instead of the documents true
summary. It has been suggested that muffin be used to filter out the
javascript junk.

I have installed muffin ( and it does filter out the
javascript, but from what I have found, muffin is more of a personl proxy
and does not work under a high load. Muffin tends to return incorrect info
such as the wrong URL, or the wrong page data when many requests are made
to it.

To get around this, I added a sleep statement in the document retreiver
loop ( so it would be forced to wait 1 second between
requests. Although very slow, the muffin htdig combo worked until I
indexed a large site.

After indexing 40,000+ documents, muffin gave up. Muffin tends to eat
memory as it goes along and then just stops responding.

Basically, the muffin/htdig combo does not really work that well. I was
wondering if anybody knows of a better way to filter out javascript. If
so, could this be incorporated into htdig.

Any Ideas???
