[htdig] turning off re-indexing?

Laura Wingerd (laura@perforce.com)
Tue, 25 May 1999 09:06:09 -0700 (PDT)

There isn't any way to turn off automatic re-indexing in htdig, is there?

I have a database of some 6000 indexed pages. Our SCM system can tell
me in a few seconds which of those pages have been updated, whereas it
takes hours for htdig to query the web server to determine the same
thing. (Part of the problem is that many of our pages have no doc dates
-- they are generated by CGI apps. Another part of the problem is that
many of the undated docs are huge.)

I can generate an HTML doc on the fly that links to only the pages that
need reindexing. (Fifty or so on a typical day.) What I like to do is
run htdig using that doc as the start_url. However, htdig seems to
insist on reindexing *all* the pages. I would be awfully nice to be
able to turn that off.

As a workaround, I do an htdig using the generated start_url doc and
create a new database of just the changed docs. Then I merge that into
the existing database. This works pretty well for us because the
majority of our indexed docs are "grow-only" docs: customer call
history, bug tracking history, and internal mail archives.

However, we do have a smattering of conventional web pages on our
intranet, and as those get changed, our db accumulates out-of-date
references to obsolete versions. The only way to get rid of those
references is to reindex the entire db, which takes hours. Now, if I
could just suppress the reindexing of all docs and have htdig reindex
only the docs I tell it to, I'd have all my ducks in a row.


Laura Wingerd laura@perforce.com Voice: 1-510-864-7400
Perforce Software, Inc. www.perforce.com Fax: 1-510-864-5340
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Tue May 25 1999 - 08:19:28 PDT