Re: [htdig] Advice wanted: Multiple mailing lists


Subject: Re: [htdig] Advice wanted: Multiple mailing lists
From: Wayde Allen (wallen@lug.boulder.co.us)
Date: Fri Mar 17 2000 - 08:20:40 PST


On Thu, 16 Mar 2000, Geoff Hutchison wrote:

> On Thu, 16 Mar 2000, David Gibbs wrote:
>
> > My question is: What is the best way to build my search indexes? Should I
> > have one large database with a search filter restriction, or should I have
> > multiple databases (one for each mailing list archive)?
>
> This depends considerably on how big these archives are going to be. On
> the WSO site, we just have one big database (now around 80,000 URLs) with
> mailing lists, student pages, etc. So anyone restricts with the search
> form restrict and exclude fields.
>
> I know of at least two mailing list archive sites that have multiple
> databases. But these folks index hundreds of high volume (e.g.
> linux-kernel and bugtraq) mailing lists.
>
> So my suggestion depends on the volume you expect to receive. If you think
> you might have multi-GB of data combined, you probably want to split them.
> It also depends a bit on whether users might want to search all of them at
> once!

In my mailing list archives the messages are archived by thread, author,
subject, date, and also stored as a large text file (month.txt). I've
tried setting the exclude_urls line in the htdig.conf file to:

exclude_urls: /cgi-bin/ .cgi subject.html author.html \
                .txt

This sort of seems to work, but I was wondering if anyone has a better
solution? I don't like excluding anything with a .txt extension for
instance, but also don't want to have to add an exclusion line explicitely
for each month.txt file.

I've also seen several references about creating files of URLs rather than
creating the long continued lines in the config file. What is the syntax
for this?

- Wayde
  (wallen@boulder.nist.gov)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Mar 17 2000 - 07:20:11 PST