[htdig] external_parsers

Patrick Dugal (dugal@lynx.cisti.nrc.ca)
Wed, 6 Oct 1999 10:23:54 -0400


I have posted about this before, but I never received any reply. If
something is not clear, please feel free to ask questions. We're currently
using ht://Dig to index over 120,000 documents on more than 50 NRC servers.
It works great for us, except for a few things.

According to the documentation on external_parsers, there is no field for
the author of the contents, or extra fields for custom data, although there
are other pertinent fields such as title, words, etc. I'd like to customize
the output (search results) from htsearch for pdf's to provide a field for
the names of the author(s) of the contents of the pdf file, and a link to an
HTML abstract, and perhaps more data related to the pdf.

The text for the names of the author(s) of the contents of a pdf file are
obtainable (through an http request) by parsing a corresponding SGML file
located on a separate machine from the one that runs htdig. The URL for the
SGML and HTML abstract is based on the URL of the pdf document. It is 100%
possible to customize the results with a CGI interface that calls htsearch
with the proper query string and changes the results (adding/removing
whatever text I want), but this takes too long (more than 20 seconds). So
I'd like to somehow insert the data into the database so that it improves
the speed of the searching dramatically.

So my main question is what would be the best way to add this customization
of the results? If it's not easy to add fields to the database, what would
be the best kludge so that HTML is preserved when data is added and
retrieved to and from an existing field? In other words, if adding a field
is too complicated, what existing field would be best to use, keeping in
mind that some HTML data will be added to the field (for look and feel) in
order to customize the results. The reason I want to play with the fields is
because the SGML files and ht://Dig are on two separate machines. I'd
rather the digging take long than a Web user doing a search having to wait
20 or more seconds for the http requests over to the other machine for the
SGML and then the parsing of these SGML files for every query.

Any advice would be appreciated. Thanks in advance!


To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Wed Oct 06 1999 - 07:34:57 PDT