Re: htdig: Pages get indexed, but no results: BUG?

Rodger Zeisler (
Thu, 17 Dec 1998 14:44:23 -0600

htdig has a case_sensitive option that would make PRODUCT.HTM and
product.htm appear the same. Since you have 3 character extenstions (.htm),
I am assuming that you are on NT not Unix (.html). NT (and MS Windows) is
case insenstive.

Rodger Zeisler
Everest Software Corp. - - Helping You
Manage Software
InfoServer LLC - - The Journal For Strategic
Outsourcing Information
Work 972.980.0013 x738
Home 972.390.0206

----- Original Message -----
From: Andriu Isenring Ritsch <>
To: Gilles Detillieux <>
Cc: <>
Sent: Thursday, December 17, 1998 10:57 AM
Subject: Re: htdig: Pages get indexed, but no results: BUG?

>Gilles Detillieux wrote:
>> According to Andriu Isenring Ritsch:
>> > I've noticed, that the links to the pages that don't seem to get
>> > all start on one page called products.htm
>> >
>> > Now there is the problem that the page that links to products.htm has
>> > two links, one to products.htm and one to PRODUCTS.HTM.
>> >
>> > Because it's a Unix server and the page name is really products.htm,
>> > PRODUCTS.HTM gives a page not found error.
>> >
>> > Is it now possible, that htdig removes all pages indexed starting form
>> > products.htm, because PRODUCTS.HTM was not found?
>> >
>> > It seems to me like that...
>> >
>> > What workaround is there? Unfortunately most of the site is linked from
>> > products.htm
>> Is fixing the defective link not an option? If not, how about making
>> sure htdig sees the good one before the bad one, somehow?
>> As it stands now, htdig keeps track of visited URLs by mapping them to
>> lower-case. This is valid for case-insensitive servers, but can be a
>> problem with case-sensitive ones - when they're not set up properly!
>> If you're careful to set up your links properly, and you don't use the
>> same name twice (one lower- and one upper-case) for different documents,
>> it shouldn't be a problem.
>> The only other "fix" would be to edit htdig/, and find all
>> instances where the "visited" object is used. These are preceeded by
>> an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
>> You'd need to remove these, or replace them with calls to a function
>> that would only map the first part of the URL (
>> to lower-case, and leave the rest of the path as mixed case. Removing
>> the lowercase() calls altogether would mean you'd have to be consistent
>> in the case used in the hostname part of the URLs - probably not a safe
>> assumption given the fact that your site isn't even consistent in the
>> case expected for the document names. Mapping the first part of the
>> URL, but leaving the path as mixed case would solve your problem, but
>> could pose a problem if you index any case-insensitive servers.
>> So, to answer the question you pose in the subject line, it's not a bug,
>> it's a feature! :-)
>I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK
and has
>a lot of links on it to other pages etc., why does htdig assume that the
link is
>not ok and deletes all pages that have been retrieved starting from the
>products.htm page? I mean, htdig got the pages and deletes them again -
>defenitely must exist, so why deleting them?
>One could also do it the other way around: htdig could know that one link
was not
>ok, but since it was able to follow the page at some point, there must be a
>page with that name, so the links from that page must also be valid etc.
>could assume that there was a uppercase/lowercase problem).
>What do you think about that?
>(I know, fixing the link should be easier, but still...)
>To unsubscribe from the htdig mailing list, send a message to
> containing the single word "unsubscribe" in
>the body of the message.

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:53 PST