htdig: how i made ht://dig the tool of my dreams

Matt Braithwaite (
09 Feb 1998 14:18:45 -0800


Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

so, we have this database of tickets that we use for tracking things.
the tickets are displayed on the web by CGIs. i wanted to index the
tickets using htdig, and it turned out to require just a few minor
modifications, which i wanted to pass on to you. the gist of the
changes is that the ticket database can save htdig time, because it
has a better idea than htdig of what has been changed recently.

so, two categories, bugs and suggestions. first the suggestions.

1) i wrote a little CGI to contain links to recently changed tickets;
so i just rerun htdig on the output of this cgi every so often. thus,
the step where htdig checks all documents it knows about for
modifications is unnecessary. it's also highly time-consuming,
because the CGI that retrieves tickets is slow, and depends on a
database backend. commenting out these lines in

    // List *list = docs.URLs();
    // retriever.Initial(*list);
    // delete list;

does the trick. this could easily and usefully be made into a
command-line switch.

2) unfortunately, the `If-Modified-Since:' header is useless with
CGIs, at least under apache. apache only bothers with the header if
serving regular files. so, my ticket-retrieval CGI inserts a
`Last-modified:' header which is date the ticket was last changed in
the database. all that's needed is to make htdig not depend on
getting `document not modified' status from the web server, by adding
this to
   if ((modtime > 0) && (modtime <= date)) {
      return Document_not_changed;

this is another thing that could be made into a command-line switch;
though i really see no harm in turning it on all the time

now, the bugs.

1) ChangeLog in 3.0.8 shows you know about the mystrncasecmp bug, so
forget that.

2) htsearch doesn't deal well with more than one `keywords'. it seems
to ignore that separates multiple keywords with \001, as well
as the possibility of some of the keywords being blank (like,
`\001foo\001\001bar' or something). my fix to this is gross; you can
do better. :-)

3) if you use the -u option to htdig, the password can still be read
by using `ps'. my half-assed fix was

               credentials = strdup(optarg);
               *optarg = 0;

but there should really be a way to take the password from stdin or
from a file (the environment can be read by `ps' as well).

4) it appears that when you instruct htdig to use alternate work files
(-a), and subsequently move the .work files onto the regular file
names, subsequent runs with -a start over; i.e., they expect there to
be preexisting .work files. so if you're using -a you really need two
copies while running htdig---you must copy the regular files to .work,
run htdig, and then move the .work files over the regular files. i
understand this now, but it surprised me---perhaps the documentation
could clarify this. also, is it really necessary? while writing the
.work files, htdig could *read* the regularly-named files to find
things like the last-modified date of a document...right? i don't
really know what i'm talking about here though.

anyway, thanks for htdig. hacking on (up?) the code was a real
pleasure. :-)

- --
Matthew Braithwaite <>
A-Link Network Services, Inc. 408.720.6161

Alors, ô ma beauté! dites à la vermine / Qui vous mangera de baisers,
Qui j'ai gardé la forme et l'essence divine / De mes amours décomposés!

Version: 2.6.2
Comment: Processed by Mailcrypt 3.4, an Emacs/PGP interface

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:41 PST