Re: [htdig] Identifying non-indexed URLs

Subject: Re: [htdig] Identifying non-indexed URLs
From: Gilles Detillieux (
Date: Tue Mar 14 2000 - 11:55:43 PST

According to Bigler, Tyson MT SSI:
> I probably didn't explain myself very well. :-D I need to identify the
> reason for the difference between the number of documents seen and the
> number of documents indexed (e.g. the number of documents indexed is always
> lower than the number of documents "seen"). I don't recall seeing "Not
> Parsable" in the output -- would I only see that in -vv mode? I've used all
> of the 3.1.x versions (currently using 3.1.5).

No, you wouldn't see "not Parsable" in the output of htdig 3.1.5, as
that message only appears in versions 3.1.0b1 and up. In 3.1.5, the
message would be "not HTML" for any document it cannot parse, as I said
in my last e-mail. You'd get that message with one or more -v options.

With two -v options (-vv), htdig will tell you about level 1 or level
2 rejections of URLs, and with three verbose options it will further
explain the reason for level 1 rejection, of which there may be several
(level 2 is because of limit_normalized). The higher the verbose level,
though, the more output you have to wade through to get at these messages.

That should tell you all you need to know about why htdig is rejecting
URLs. You may also need to look at why htmerge would reject some.
Reasons for this are less clearly explained in error messages. The most
common message from htmerge (on a fresh database at least) is "Deleted, no
excerpt", which is usually because of a noindex directive in the document,
the document is disallowed by robots.txt, or server_max_docs was reached.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Mar 14 2000 - 12:01:25 PST