Re: htdig: htdig, muffin and javascript

Mark Boyns (
17 Sep 1998 09:15:53 -0700

> About three weeks ago I experimented with muffin to filter out javascript
> from the search indexes (it did not work that well). Here is my experience:
> It is known that htdig does not parse javascript properly. The search
> summaries disply the ugly javascript code instead of the documents true
> summary. It has been suggested that muffin be used to filter out the
> javascript junk.
> I have installed muffin ( and it does filter out the
> javascript, but from what I have found, muffin is more of a personl proxy
> and does not work under a high load. Muffin tends to return incorrect info
> such as the wrong URL, or the wrong page data when many requests are made
> to it.
> To get around this, I added a sleep statement in the document retreiver
> loop ( so it would be forced to wait 1 second between
> requests. Although very slow, the muffin htdig combo worked until I
> indexed a large site.
> After indexing 40,000+ documents, muffin gave up. Muffin tends to eat
> memory as it goes along and then just stops responding.
> Basically, the muffin/htdig combo does not really work that well. I was
> wondering if anybody knows of a better way to filter out javascript. If
> so, could this be incorporated into htdig.

Andrew forwarded me this message a while back so I guess I should
respond. With regards to Muffin eating up a lot of memory, did you
try running Muffin without the GUI? You can do this at startup with
the -nw option.

I not sure about why Muffin would return incorrect results. Is there
any way you can reproduce this?

A new version of Muffin will be released next week sometime. This
version does have at least one HTML parsing fix.
