Re: [htdig3-dev] Portable indexing


Subject: Re: [htdig3-dev] Portable indexing
From: Rick Richardson (rick@digi.com)
Date: Sat Nov 06 1999 - 21:15:12 PST


Thanks to all who helped with suggestions.

Here is version 1 of my solution. The short shell script
that does all the work is an attachment.

-Rick

For a long time I've wanted to turn my vast email archives into HTML
and index *every single word* in those archives. Then, I wanted to
blast those archives onto CD-ROMs for permanent archival storage.

I have been eyeing a very nice, free indexing package called "htdig",
as the search engine of choice for this purpose (http://www.htdig.org).
It will index every word in a collection of text and HTML files.

The problem with htdig is that the conventional usage is to index an
entire web site on a single machine into a single database.

I wanted to index several collections independantly, and wanted to be
able to easily move those collections and their indexes between
machines and onto CD-ROM without having to do a lot of work to
"install" the database onto each machine.

I worked out a shell script to do what I wanted to do. I have
attached said shell script "digdir".

As a proof of concept, I decided to use the latest copy of the
Internet RFC collection as a test.

I started with the 2700 or so RFC's in text form. I stored these into
directory /home/httpd/html/rfc.

I then run the "digdir" shell script thusly:

        $ cd /home/httpd/html
        $ digdir rfc

After about 5 minutes, the shell script finishes the indexing process.
It adds a number of new files under /home/httpd/html/rfc, but does not
add or modify any other files on the computer. These new files
include a "search.html" search form used for submitting queries, and
the indexed database generated by htdig.

It is possible to now blast this entire directory onto CD-R, and you
could mount this CD-ROM on another machine under /home/httpd/html and
it would work (assuming you have previously installed the stock htdig
RPM package).

To see the results, open this URL (will work only on the Digi intranet):

        http://digifax.digi.com/rfc/search.html

In the search form, type "url" or anything else you'd like to search for.

Enjoy.

-Rick

-- 
Rick Richardson  rick@digi.com   http://RickRichardson.freeservers.com/

My current CI is 28. I'm 41. I need 14 more cylinders by my next birthday. Two PWC's and an SUV ought to do it. Thats my new goal.


------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2b25 : Sat Nov 06 1999 - 21:25:27 PST