Re: [htdig] File formats supported


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 25 Feb 1999 16:00:36 -0600 (CST)


According to Ismael Olea:
> Gilles Detillieux escribió:
> > Seriously, the main page at that URL does mention it. If you scroll down
> > to the Features section, it says:
> >
> > - Searching of HTML and text files
> > Both HTML documents and plain text files can be searched.
> > Searching of other file types will be supported in future versions.
>
> htdig can handle sgml files too? And, can it manage meta tags in html
> files?

No, I don't think it can handle SGML. I'm not familiar with SGML, but my
understanding is that a lot of its tags are quite different than HTML's.
Also, the http server would likely assign a different content-type to
SGML documents, so htdig won't even attempt to parse them.

Meta tags in HTML are supported by htdig.

> > That's not quite the whole story, though. There is some support for
> > PDF documents right now, if you have acroread (Adobe Acrobat Reader) on
> > your system. Also, with external parsers, you can index a whole lot more.
>
> This external parsers must be htdig aware or can be unix-like? Where
> can I find they?
>
> > The parse_doc.pl script in ht://Dig 3.1.1's contrib directory can handle
>
> Looks very interesting.

External parsers must definitely be htdig aware. Their output must adhere
to the format specified in the documentation. See

        http://www.htdig.org/attrs.html#external_parsers

for details. The parse_doc.pl script, and its earlier versions as perl
and shell scripts, is the only external parser around that's publically
available, as far as I know. Someone on the list can correct me if I'm
wrong. parse_doc.pl is also a good starting point if you want to set
up an interface between htdig and any number of more Unix-like document
parsers. Any filter that can extract plain text from a document can
easily be plugged into this script, and it handles the generation of
records for htdig.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST