Re: [htdig] File formats supported

Gilles Detillieux (
Thu, 25 Feb 1999 16:00:36 -0600 (CST)

According to Ismael Olea:
> Gilles Detillieux escribió:
> > Seriously, the main page at that URL does mention it. If you scroll down
> > to the Features section, it says:
> >
> > - Searching of HTML and text files
> > Both HTML documents and plain text files can be searched.
> > Searching of other file types will be supported in future versions.
> htdig can handle sgml files too? And, can it manage meta tags in html
> files?

No, I don't think it can handle SGML. I'm not familiar with SGML, but my
understanding is that a lot of its tags are quite different than HTML's.
Also, the http server would likely assign a different content-type to
SGML documents, so htdig won't even attempt to parse them.

Meta tags in HTML are supported by htdig.

> > That's not quite the whole story, though. There is some support for
> > PDF documents right now, if you have acroread (Adobe Acrobat Reader) on
> > your system. Also, with external parsers, you can index a whole lot more.
> This external parsers must be htdig aware or can be unix-like? Where
> can I find they?
> > The script in ht://Dig 3.1.1's contrib directory can handle
> Looks very interesting.

External parsers must definitely be htdig aware. Their output must adhere
to the format specified in the documentation. See

for details. The script, and its earlier versions as perl
and shell scripts, is the only external parser around that's publically
available, as far as I know. Someone on the list can correct me if I'm
wrong. is also a good starting point if you want to set
up an interface between htdig and any number of more Unix-like document
parsers. Any filter that can extract plain text from a document can
easily be plugged into this script, and it handles the generation of
records for htdig.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST