Re: [htdig] Question about parsing word, pdf, ppt etc.


Subject: Re: [htdig] Question about parsing word, pdf, ppt etc.
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Tue Dec 19 2000 - 03:56:05 PST


Try executing the parsers at the command line to see what happens.

I don't know, but it seems quite possible that the current version of
ppt2html is not able to cope with the Powerpoint 2000 format. If that is
the case you could try contacting the author directly. I have noticed that
ppt2html can require a lot of memory (several hundred megabytes) to convert
some .ppt files, could you have a problem with a shortage of memory?

Are you using catdoc or wp2html to convert Word files? Wp2html extracts the
'subject' from the document summary and puts it in the header, which might
be the problem. Catdoc does often include gibberish in its output, and you
could find removing the -b option in the call of catdoc an improvement.

Doc2html.pl uses pdfinfo to extract the title of the .PDF file, and I have
seen .PDF documents where the title is ' ' for some reason. You might
need to modify doc2html.pl to supress such titles.

----- Original Message -----
From: "Aditya Shah" <adshah@coreon.net>
To: <htdig@htdig.org>
Cc: <rwani@coreon.net>
Sent: Tuesday, December 19, 2000 2:28 AM
Subject: [htdig] Question about parsing word, pdf, ppt etc.

> Hello,
>
> We are evaluating the use of htDig for an intranet site. Our users publish
a
> lot of Word, Excel, Powerpoint and PDF Documents that we want to be able
to
> search through.
>
> We have been able to get all the external parsers required. We have run
into
> the following issues:
>
> 1) Unable to parse powerpoint Documents. The documents are MS- Powerpoint
> 2000 Documents. We got the ppt2html parser from www.xlHtml.org . The
> statements in htdig.conf are something like this:
>
> application/msexcel->text/html /app/doc2html/doc2html.pl \
> application/mspowerpoint->text/html
> /app/doc2html/doc2html.pl \
> application/vnd.ms-excel->text/html
> /app/doc2html/doc2html.pl \
> application/vnd.ms-powerpoint->text/html
> /app/doc2html/doc2html.pl
>
> Excel works great, but for powerpoint, when I run the 'rundig' program, it
> just kind of hangs there.
>
> 2) Getting gibberish in the headers for some word and pdf documents. For
> example, for a word document:
>
> In doc 2 html ; ; ; ; ; ; ; ; Fax Fax Please Recycle Comments: `"?
> gP?]...u-OwP?+`?0|?( o?UY{O?rs-| ?]* ! ^mB?t
> 5?z+Hc-#*g"C?,m?Pss (_~$+-V S??_+yw?&lt;?-? ?\...Y ...
>
> when the search results are returned. This does not happen for all word
> documents, only for some of them.
>
> And for a PDF document, we always get the ' ' character before any file
> name in the search results section.
>
> Also, do you know if there is a parser for MS-Visio?
>
> Any help would be appreciated.
>
> Thanks.
>
> Aditya Shah
>
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-unsubscribe@htdig.org
> You will receive a message to confirm this.
> List archives: <http://www.htdig.org/mail/menu.html>
> FAQ: <http://www.htdig.org/FAQ.html>
>
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Tue Dec 19 2000 - 04:06:39 PST