Re: [htdig] Searching Word Documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 15 Jul 1999 10:50:23 -0500 (CDT)


According to webmaster@chester.ac.uk:
>
>
> I've set up an external parser with htdig to search Word documents
> (parse_doc.pl.gz from the Contributed Word section) and setup
> appropriate external_parsers: application/MSWord type. However

First of all, I think you need to specify application/msword all in
lower case, both in your server's mime.types file, and in the
external_parsers attribute. Currently, htdig does a case-sensitive
lookup (in the Dictionary class) for these types, so they have to
match exactly. I think maybe this needs to be fixed in ExternalParser.cc.

> htdig does seem to load these files now (it used to have unknown
> filetype error on these links), but when I'm using ./rundig
> (customised with my server url) I get the notification:
>
> "Deleted, no excerpt"
>
> Any clue anyone. I need to parse these documents as they form
> part of the my college's intranet site. Any help would be
> appreciated.

Somehow, parse_doc.pl isn't finding any text in the document. It uses
the catdoc tool to do this, so you should: a) make sure parse_doc.pl
is configured with the right path to the catdoc utility, b) try catdoc
directly on some of the Word documents that aren't being indexed, and c)
try parse_doc.pl directly on these documents to make sure it's outputting
the correct "h" and "w" records. The "h" record is the one that contains
the whole excerpt as one very long line.

I've found that catdoc has problems with some of the newer OLE based Word
documents (especially the older version of catdoc). If this is the case,
you may be out of luck unless you can find some tool that will extract
the text from your documents.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Jul 15 1999 - 08:07:33 PDT