Subject: Re: [htdig3-dev] Portable indexing
From: Rick Richardson (rick@digi.com)
Date: Sat Nov 06 1999 - 21:15:12 PST
Thanks to all who helped with suggestions.
Here is version 1 of my solution. The short shell script
that does all the work is an attachment.
-Rick
For a long time I've wanted to turn my vast email archives into HTML
and index *every single word* in those archives. Then, I wanted to
blast those archives onto CD-ROMs for permanent archival storage.
I have been eyeing a very nice, free indexing package called "htdig",
as the search engine of choice for this purpose (http://www.htdig.org).
It will index every word in a collection of text and HTML files.
The problem with htdig is that the conventional usage is to index an
entire web site on a single machine into a single database.
I wanted to index several collections independantly, and wanted to be
able to easily move those collections and their indexes between
machines and onto CD-ROM without having to do a lot of work to
"install" the database onto each machine.
I worked out a shell script to do what I wanted to do. I have
attached said shell script "digdir".
As a proof of concept, I decided to use the latest copy of the
Internet RFC collection as a test.
I started with the 2700 or so RFC's in text form. I stored these into
directory /home/httpd/html/rfc.
I then run the "digdir" shell script thusly:
$ cd /home/httpd/html
$ digdir rfc
After about 5 minutes, the shell script finishes the indexing process.
It adds a number of new files under /home/httpd/html/rfc, but does not
add or modify any other files on the computer. These new files
include a "search.html" search form used for submitting queries, and
the indexed database generated by htdig.
It is possible to now blast this entire directory onto CD-R, and you
could mount this CD-ROM on another machine under /home/httpd/html and
it would work (assuming you have previously installed the stock htdig
RPM package).
To see the results, open this URL (will work only on the Digi intranet):
http://digifax.digi.com/rfc/search.html
In the search form, type "url" or anything else you'd like to search for.
Enjoy.
-Rick
-- Rick Richardson rick@digi.com http://RickRichardson.freeservers.com/My current CI is 28. I'm 41. I need 14 more cylinders by my next birthday. Two PWC's and an SUV ought to do it. Thats my new goal.
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2b25 : Sat Nov 06 1999 - 21:25:27 PST