Re: [htdig] Searching Word Documents

Gilles Detillieux (
Thu, 15 Jul 1999 10:50:23 -0500 (CDT)

According to
> I've set up an external parser with htdig to search Word documents
> ( from the Contributed Word section) and setup
> appropriate external_parsers: application/MSWord type. However

First of all, I think you need to specify application/msword all in
lower case, both in your server's mime.types file, and in the
external_parsers attribute. Currently, htdig does a case-sensitive
lookup (in the Dictionary class) for these types, so they have to
match exactly. I think maybe this needs to be fixed in

> htdig does seem to load these files now (it used to have unknown
> filetype error on these links), but when I'm using ./rundig
> (customised with my server url) I get the notification:
> "Deleted, no excerpt"
> Any clue anyone. I need to parse these documents as they form
> part of the my college's intranet site. Any help would be
> appreciated.

Somehow, isn't finding any text in the document. It uses
the catdoc tool to do this, so you should: a) make sure
is configured with the right path to the catdoc utility, b) try catdoc
directly on some of the Word documents that aren't being indexed, and c)
try directly on these documents to make sure it's outputting
the correct "h" and "w" records. The "h" record is the one that contains
the whole excerpt as one very long line.

I've found that catdoc has problems with some of the newer OLE based Word
documents (especially the older version of catdoc). If this is the case,
you may be out of luck unless you can find some tool that will extract
the text from your documents.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Thu Jul 15 1999 - 08:07:33 PDT