[htdig] Questions about what's possible with ht://Dig...


Albert Lunde (Albert-Lunde@nwu.edu)
Tue, 6 Jul 1999 13:51:17 -0500


I'm looking at several freeware software packages to see what would be more
useful to do campus-wide indexing at our University.

(We'd originally planned to use the commercial OpenText software, but
current versions of that software seem to be too tightly integrated with
their document management system.)

What we'd like to do is remotely spider a number of small servers in
rotation, (say once every week or two) while indexing some larger servers
thru the file system (say nightly), then some how do a single query to
search all those indexes.

We _don't_ want to mirror all the HTML for all the servers, all the time,
just store the indexes. (Total disk space is a limiting resource.)

From what I've read so far ht://Dig seems like a pretty flexible spider;
which could be configured to spider remote systems, or to access the local
server directly.

It sounds like http://www.htdig.org/files/contrib/scripts/multidig.tar.gz
might be useful for running a series of indexes on various servers.

I have a few questions about what is possible:

(1) Is the only way to deal with queries across multiple indexes to combine
the indexes with htmerge, or is there a way to query more than one index
and aggregate the results?

(2) Can your data files be copied between systems (e.g. doing local
indexing on one server, then copying with ftp or scp to another server for
merging or searching)? I can think of several sorts of issues:
  - absolute path names
  - byte order or floating point across archtectures

(In our environment, most of our Unix web servers are running HP-UX, so
processor architecture isn't a big problem, but I'd like to know if it's an
issure, for future reference, anyway.)

(2) Is there a way to index all the HTML files in a directory tree,
regardless of how they are linked, (or some other arbitrary list of files
on the local system)?

(3) Is it feasible to use the ht://Dig spider with some different search
and index software?

I guess the last two questions depend on what the interface is between the
spider and indexing software: to what extent it is exported in a form that
external software could be added or to what extent the whole package is too
interconnected to pick apart.

Other software I'm looking at with the same concerns in mind are:

SWISH-E:
http://sunsite.berkeley.edu/SWISH-E/

The revived "Harvest" software from:
http://www.tardis.ed.ac.uk/harvest/

"Combine" together with "Zebra":
http://www.lub.lu.se/combine/
http://www.indexdata.dk/zebra/

If you'd care to comment on the pros or cons of any of this, I'd be interested.

Direct replies to me or the list, as you think appropriate.

---
    Albert Lunde                      Albert-Lunde@nwu.edu
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Jul 06 1999 - 11:09:12 PDT