Re: htdig: htdig, muffin and javascript


Mark Boyns (boyns@sdsu.edu)
17 Sep 1998 09:15:53 -0700


> About three weeks ago I experimented with muffin to filter out javascript
> from the search indexes (it did not work that well). Here is my experience:
>
> It is known that htdig does not parse javascript properly. The search
> summaries disply the ugly javascript code instead of the documents true
> summary. It has been suggested that muffin be used to filter out the
> javascript junk.
>
> I have installed muffin (http://muffin.doit.org) and it does filter out the
> javascript, but from what I have found, muffin is more of a personl proxy
> and does not work under a high load. Muffin tends to return incorrect info
> such as the wrong URL, or the wrong page data when many requests are made
> to it.
>
> To get around this, I added a sleep statement in the document retreiver
> loop (Retreiver.cc) so it would be forced to wait 1 second between
> requests. Although very slow, the muffin htdig combo worked until I
> indexed a large site.
>
> After indexing 40,000+ documents, muffin gave up. Muffin tends to eat
> memory as it goes along and then just stops responding.
>
> Basically, the muffin/htdig combo does not really work that well. I was
> wondering if anybody knows of a better way to filter out javascript. If
> so, could this be incorporated into htdig.

Andrew forwarded me this message a while back so I guess I should
respond. With regards to Muffin eating up a lot of memory, did you
try running Muffin without the GUI? You can do this at startup with
the -nw option.

I not sure about why Muffin would return incorrect results. Is there
any way you can reproduce this?

A new version of Muffin will be released next week sometime. This
version does have at least one HTML parsing fix.
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:48 PST