Re: [htdig] using perl/cron to find badwords on site


Subject: Re: [htdig] using perl/cron to find badwords on site
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Jan 11 2001 - 07:15:28 PST


According to Jerry Preeper:
> I don't know if anyone else has run across this yet, but I have a number of
> guestbooks and things like that where people can post and I would love to
> be able to find a way to set up a daily cron job with perl script that
> basically runs a set of badwords through htsearch and then emails me a list
> of just the urls it finds with those words in it... I don't really need
> things like the page title or description or stuff like that.. I'm
> assuming I'll need to use a system call in the script to some sort of
> command line option and loop it for each word... Any input would be
> greatly appreciated.

I assume that you want your htdig database updated through this same
cron job, before running htsearch, so that the database you search will
contain any new postings to the guestbooks. The simplest way I can
think of, assuming the correct settings are already made in htdig.conf,
would be a shell script with these commands...

  htdig
  htmerge
  /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4"

Of course, if you want to write it in Perl, especially if you need more
processing than simply running these programs, you can call the above
commands in one or more calls to the system("...") function in Perl.

You may want to customise the htsearch templates to get just the URL,
if that's all you want (see template_map, search_results_header and
search_results_footer in http://www.htdig.org/attrs.html). If you want
to search for each word separately, rather than one query for all words,
then you'd need to call htsearch once for each individual word. E.g. in
a shell script, you could do:

  htdig; htmerge
  for word in badword1 badword2 badword3 badword4
  do
    echo "${word}:"
    /path/to/cgi-bin/htsearch "words=${word}"
  done

or:

  htdig; htmerge
  while read word
  do
    echo "${word}:"
    /path/to/cgi-bin/htsearch "words=${word}"
  done < /path/to/bad-word-file

However, it seems to me it would be better to search for all at once,
unless you need a word by word summary of URLs.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Jan 11 2001 - 07:29:22 PST