[htdig] Question about parsing word, pdf, ppt etc.

Subject: [htdig] Question about parsing word, pdf, ppt etc.
From: Aditya Shah (adshah@coreon.net)
Date: Mon Dec 18 2000 - 18:28:43 PST


We are evaluating the use of htDig for an intranet site. Our users publish a
lot of Word, Excel, Powerpoint and PDF Documents that we want to be able to
search through.

We have been able to get all the external parsers required. We have run into
the following issues:

1) Unable to parse powerpoint Documents. The documents are MS- Powerpoint
2000 Documents. We got the ppt2html parser from www.xlHtml.org . The
statements in htdig.conf are something like this:

                    application/msexcel->text/html /app/doc2html/doc2html.pl \
/app/doc2html/doc2html.pl \
/app/doc2html/doc2html.pl \

Excel works great, but for powerpoint, when I run the 'rundig' program, it
just kind of hangs there.

2) Getting gibberish in the headers for some word and pdf documents. For
example, for a word document:

In doc 2 html ; ; ; ; ; ; ; ; Fax Fax Please Recycle Comments: `"?
gP?]...u-OwP?+`?0|?( ?UY{O?r-| ?]* ! ^mB?t
5?+Hc-#*g"C?,m?Pss (_~$+-V S??_+yw?<?-? ?\...Y ...

when the search results are returned. This does not happen for all word
documents, only for some of them.

And for a PDF document, we always get the ' ' character before any file
name in the search results section.

Also, do you know if there is a parser for MS-Visio?

Any help would be appreciated.


Aditya Shah

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>

This archive was generated by hypermail 2b28 : Mon Dec 18 2000 - 18:39:48 PST