Re: [htdig] Index word doc meta tags (toc)


Subject: Re: [htdig] Index word doc meta tags (toc)
From: David Robley (huntsman@www.nisu.flinders.edu.au)
Date: Tue Apr 18 2000 - 22:52:27 PDT


On 18 Apr, Geoff Hutchison wrote:
> At 4:11 PM +0000 4/18/00, Steve Wambolt3 wrote:
>>I have just installed htdig -- plus the parse.pl script to index pdf and
>>word documents .. so far it looks great ...
>>[snip]
>>Example - I have a 50 page worddoc - it has a 3 page table of contents (when
>>you reveal code in the word doc you get this) {TOC \o "1-2"} - What I
>>would like to be able todo is tell htdig to index ONLY the table of
>>content - I guess by passing it the metatag above ????
>
> You don't mention what program you're using to convert the Word
> documents, so I'll assume catdoc. I would use this program to convert
> one of your documents and take a look to see if there's an easy way
> to separate the TOC section from the rest of the document. Then you'd
> want to hack the Perl script (I'm guessing from your comments that
> you're using parse_doc.pl -- conv_doc.pl or the new doc2html scripts
> should work as well) to ignore everything but this.
>

You may also wish to consider a package called rtf2html
(http://www.sunpack.com/RTF/

Then you can turn your M$ Turd documents into useful HTML :-) Yes, it
costs, but we have found it well worth the few bucks. Much of our
content is produced as massive (well, by html page sizes) Word docs,
which we use rtf2html to translate, break up to useful sizes and link
the chunks with next/prev links. Then feed it to htdig!

Cheers

-- 
David Robley                        | WEBMASTER & Mail List Admin
RESEARCH CENTRE FOR INJURY STUDIES  | http://www.nisu.flinders.edu.au/
AusEinet                            | http://auseinet.flinders.edu.au/
            Flinders University, ADELAIDE, SOUTH AUSTRALIA

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Apr 18 2000 - 20:39:14 PDT