Re: [htdig] Word documents indexing problem


Subject: Re: [htdig] Word documents indexing problem
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Jun 07 2000 - 09:15:38 PDT


According to Jean-Francois Le Carre Petit:
> I use htdig 3.1.5 on linux Redhat 6.1.
>
> I have configured htdig.conf file as follows :
>
> valid_extensions: .html .htm .doc .pdf .txt
> local_default_doc: new_index.html index.html index.htm main.htm
> main_frame.htm frame.htm content.htm title.htm main2.htm
>
> local_urls_only: true
>
> local_urls: http://gnbuxsl.grenoble.hp.com:8090/=/var/opt/web/
>
> #
> # Since ht://Dig does not (and cannot) parse every document type, this
> # attribute is a list of strings (extensions) that will be ignored
> during
> # indexing. These are *only* checked at the end of a URL, whereas
> # exclude_url patterns are matched anywhere.
> #
> bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com
> .gif \
> .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg
> .mov .avi
>
> max_doc_size: 20000000
>
> external_parsers: application/msword->text/html
> /usr/local/bin/parse_doc.pl \
> application/postscript->text/html
> /usr/local/bin/parse_doc.pl \
> application/pdf->text/html /usr/local/bin/parse_doc.pl
>
> pdf files indexing works fine

It may seem to work fine, but you'll be getting extra garbage in
your excerpts, and the weights of words in the database will be
thrown off because the words are duplicated. The problem is you're
treating parse_doc.pl as an external converter, not an external parser.
parse_doc.pl does not output HTML, but that's what you're telling htdig
to expect. You really should be using conv_doc.pl or doc2html.pl as an
external converter, and toss parse_doc.pl aside.

Take a closer look at http://www.htdig.org/attrs.html#external_parsers and
http://www.htdig.org/FAQ.html#q4.9 for the distinction between the two.
It's a bit confusing because both converters and parsers are specified
via the external_parsers attribute, but they function rather differently.

> whereas I get the following message when
> indexing msword files :
>
> 30:30:2:http://gnbuxsl.grenoble.hp.com:8090/doc/tech/casc/details_casc.doc:
> Trying local files
> found existing file /var/opt/web/doc/tech/casc/details_casc.doc
> not found
>
> The file /var/opt/web/doc/tech/casc/details_casc.doc actually exists...
>
> I don't understand what the problem can be. Running rundig with several
> additional -v options does not help.
>
> Could somebody help me ?

You can't index .doc files locally, at least not without modifying the
Document::RetrieveLocal() code in htdig/Document.cc.

Take a closer look at http://www.htdig.org/attrs.html#local_urls to see
what file name extensions it handles.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Jun 07 2000 - 07:18:16 PDT