Re: [htdig] Word documents indexing problem


Subject: Re: [htdig] Word documents indexing problem
From: Jean-Francois Le Carre Petit (jean-francois_le-carre-petit@hp-france-om18.om.hp.com)
Date: Wed Jun 07 2000 - 05:20:50 PDT


D.J.Adams@soton.ac.uk wrote:
>
> >
> > Hello,
> >
> > I use htdig 3.1.5 on linux Redhat 6.1.
> >
> > I have configured htdig.conf file as follows :
> >
> > valid_extensions: .html .htm .doc .pdf .txt
> > local_default_doc: new_index.html index.html index.htm main.htm
> > main_frame.htm frame.htm content.htm title.htm main2.htm
> >
> > local_urls_only: true
> >
> > local_urls: http://gnbuxsl.grenoble.hp.com:8090/=/var/opt/web/
> >
> > #
> > # Since ht://Dig does not (and cannot) parse every document type, this
> > # attribute is a list of strings (extensions) that will be ignored
> > during
> > # indexing. These are *only* checked at the end of a URL, whereas
> > # exclude_url patterns are matched anywhere.
> > #
> > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com
> > .gif \
> > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg
> > .mov .avi
> >
> > max_doc_size: 20000000
> >
> > external_parsers: application/msword->text/html
> > /usr/local/bin/parse_doc.pl \
> > application/postscript->text/html
> > /usr/local/bin/parse_doc.pl \
> > application/pdf->text/html /usr/local/bin/parse_doc.pl
> >
> > pdf files indexing works fine whereas I get the following message when
> > indexing msword files :
> >
> > 30:30:2:http://gnbuxsl.grenoble.hp.com:8090/doc/tech/casc/details_casc.doc:
> > Trying local files
> > found existing file /var/opt/web/doc/tech/casc/details_casc.doc
> > not found
> >
> > The file /var/opt/web/doc/tech/casc/details_casc.doc actually exists...
> >
> > I don't understand what the problem can be. Running rundig with several
> > additional -v options does not help.
> >
> > Could somebody help me ?
> >
> > Thanks,
> > Jean-Francois.
> > --
>
> I think the "not found" could refer to the utility which you are using
> within parse_doc.pl to handle word documents.
>
> Try calling parse_doc.pl from the command line:
>
> parse_doc.pl /var/opt/web/doc/tech/casc/details_casc.doc arg2 arg3
>
> and see what happens.
>
> --
>
> David J Adams
> <D.J.Adams@soton.ac.uk>
> Computing Services
> University of Southampton

Hello,

It works, I use the same script for acrobat files and it works properly.

It uses catdoc located in /usr/local/bin :

ll /usr/local/bin/catdoc
-rwxr-xr-x 1 root root 55235 May 19 14:12
/usr/local/bin/catdoc

I think the problem is somewhere else.

Thanks,
Jean-Francois.

--

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Jun 07 2000 - 04:16:43 PDT