Re: [htdig] htdig database related questions

Subject: Re: [htdig] htdig database related questions
From: Geoff Hutchison (
Date: Wed Dec 06 2000 - 05:57:54 PST

At 9:35 AM +0100 12/6/00, Haeberlen@RUS.Uni-Stuttgart.DE wrote:
>Is there anything wrong with our db files? htsearch seems to be able
>to use them, though. Am I missing something?

No, but I don't think you want to use the db_dump programs to deal
with them. In particular, ht://Dig "serializes" the documents in the
document DB and can compress the excerpts, so large parts will come
out in binary.

>Why do I want to "edit" the db files at all? The reason is that we have
>a large database with quite a number of things we'd like to exclude
>from the search results. The obvious solution would be to exclude them
>from the dig in the first place. But I don't consider this possible
>because a) this would make the config quite bulky

You can always include a file in the config file, e.g.:
exclude_urls: `/path/to/patterns`

In the 3.2 code, you can do limited editing with the new htdump and
htload programs. On the other hand, if you just want to delete URLs,
it's much easier with the new htpurge program instead.

>PS: How does htdig handle the case where a document is in the docs database
>but the corresponding URL is added to the exclude list? Will the document
>be deleted from the db on the next update run, or would I have to delete the
>db and run a "full index" again?

The exclude_urls pattern set is only used when considering whether to
index a new URL. So if a URL is already in the database, it will not
be removed. There is a similar, but more serious problem, if a
document is added to the robots.txt file. In both cases, the code is
upholding the "letter of the law," but it's a bit hazy.

-Geoff Hutchison
Williams Students Online

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this. List archives: <> FAQ: <>

This archive was generated by hypermail 2b28 : Wed Dec 06 2000 - 06:14:13 PST