Re: htdig: Pages get indexed, but no results: BUG?


Rodger Zeisler (rzeisler@eversoft.com)
Thu, 17 Dec 1998 14:44:23 -0600


htdig has a case_sensitive option that would make PRODUCT.HTM and
product.htm appear the same. Since you have 3 character extenstions (.htm),
I am assuming that you are on NT not Unix (.html). NT (and MS Windows) is
case insenstive.

Rodger Zeisler
Everest Software Corp. - http://www.outsourcing-mgmt.com/ - Helping You
Manage Software
InfoServer LLC - http://www.infoserver.com - The Journal For Strategic
Outsourcing Information
rzeisler@eversoft.com
Work 972.980.0013 x738
Home 972.390.0206

----- Original Message -----
From: Andriu Isenring Ritsch <webmaster@netsolution.ch>
To: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Cc: <htdig@sdsu.edu>
Sent: Thursday, December 17, 1998 10:57 AM
Subject: Re: htdig: Pages get indexed, but no results: BUG?

>
>
>Gilles Detillieux wrote:
>
>> According to Andriu Isenring Ritsch:
>> > I've noticed, that the links to the pages that don't seem to get
indexed
>> > all start on one page called products.htm
>> >
>> > Now there is the problem that the page that links to products.htm has
>> > two links, one to products.htm and one to PRODUCTS.HTM.
>> >
>> > Because it's a Unix server and the page name is really products.htm,
>> > PRODUCTS.HTM gives a page not found error.
>> >
>> > Is it now possible, that htdig removes all pages indexed starting form
>> > products.htm, because PRODUCTS.HTM was not found?
>> >
>> > It seems to me like that...
>> >
>> > What workaround is there? Unfortunately most of the site is linked from
>> > products.htm
>>
>> Is fixing the defective link not an option? If not, how about making
>> sure htdig sees the good one before the bad one, somehow?
>>
>> As it stands now, htdig keeps track of visited URLs by mapping them to
>> lower-case. This is valid for case-insensitive servers, but can be a
>> problem with case-sensitive ones - when they're not set up properly!
>> If you're careful to set up your links properly, and you don't use the
>> same name twice (one lower- and one upper-case) for different documents,
>> it shouldn't be a problem.
>>
>> The only other "fix" would be to edit htdig/Retriever.cc, and find all
>> instances where the "visited" object is used. These are preceeded by
>> an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
>> You'd need to remove these, or replace them with calls to a function
>> that would only map the first part of the URL (http://host.name.dom/)
>> to lower-case, and leave the rest of the path as mixed case. Removing
>> the lowercase() calls altogether would mean you'd have to be consistent
>> in the case used in the hostname part of the URLs - probably not a safe
>> assumption given the fact that your site isn't even consistent in the
>> case expected for the document names. Mapping the first part of the
>> URL, but leaving the path as mixed case would solve your problem, but
>> could pose a problem if you index any case-insensitive servers.
>>
>> So, to answer the question you pose in the subject line, it's not a bug,
>> it's a feature! :-)
>
>I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK
and has
>a lot of links on it to other pages etc., why does htdig assume that the
link is
>not ok and deletes all pages that have been retrieved starting from the
>products.htm page? I mean, htdig got the pages and deletes them again -
they
>defenitely must exist, so why deleting them?
>One could also do it the other way around: htdig could know that one link
was not
>ok, but since it was able to follow the page at some point, there must be a
valid
>page with that name, so the links from that page must also be valid etc.
(htdig
>could assume that there was a uppercase/lowercase problem).
>
>What do you think about that?
>
>(I know, fixing the link should be easier, but still...)
>
>Andriu
>
>----------------------------------------------------------------------
>To unsubscribe from the htdig mailing list, send a message to
>htdig-request@sdsu.edu containing the single word "unsubscribe" in
>the body of the message.
>

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:53 PST