htdig: MS Office files -- help indexing them, please!


Tyson Bigler (bigler@shellus.com)
Tue, 24 Nov 1998 10:40:48 -0600 (CST)


I'm setting up an export control scan to comply with federal export control
laws. My peers at other business units are all using (well,
attempting to use AltaVista -- I have gotten *much* further *much* faster
with ht://Dig), but I have choosen ht://Dig (and a good dose of Perl ;-D).
Anyways, I have been able to parse MS Word and MS Excel files with ht://Dig,
but I am also required to look at other MS Office documents (i.e.
powerpoint). Does anyone have an external parser for me??!! My peers keep
telling me that AltaVista has all of these "filters" (aka parsers), but I
haven't seen/used them...

I am also having difficulty with htmerge on a fairly large (and it will only
grow larger) index. The specific error seems to be coming from the sort
command. When using the standard sort included with Solaris 2.5.1 I get:

# htmerge -c conf/unix.conf -v -s
htmerge: Sorting...
sort: can't create /home/atlantis8/bigler/stmAAAa00598/a: Not a directory
htmerge: Word sort failed

and when using the GNU sort included with textutils-1.22 I get:

# htmerge -c conf/unix.conf -v -s
htmerge: Sorting...
/home/atlantis3/bigler/opt/bin/sort: read error: Invalid argument
htmerge: Word sort failed

Any help would be *greatly* appreciated. I had rather not go the other
direction and be forced into AltaVista.... ;-D And I'd like to deliver a
solution way ahead of the "other guy". ;-D

Thanks,

Tyson

---
M. Tyson Bigler                  SEPTCo Computing Solutions Group
Infrastructure Support           Bellaire Technology Center
bigler@shellus.com               3737 Bellaire Blvd., Room 1007B
    713-245-7476                 Houston, TX 77025

---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:51 PST