Re: htdig: Pages get indexed, but no results: BUG?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 17 Dec 1998 09:28:55 -0600 (CST)


According to Andriu Isenring Ritsch:
> I've noticed, that the links to the pages that don't seem to get indexed
> all start on one page called products.htm
>
> Now there is the problem that the page that links to products.htm has
> two links, one to products.htm and one to PRODUCTS.HTM.
>
> Because it's a Unix server and the page name is really products.htm,
> PRODUCTS.HTM gives a page not found error.
>
> Is it now possible, that htdig removes all pages indexed starting form
> products.htm, because PRODUCTS.HTM was not found?
>
> It seems to me like that...
>
> What workaround is there? Unfortunately most of the site is linked from
> products.htm

Is fixing the defective link not an option? If not, how about making
sure htdig sees the good one before the bad one, somehow?

As it stands now, htdig keeps track of visited URLs by mapping them to
lower-case. This is valid for case-insensitive servers, but can be a
problem with case-sensitive ones - when they're not set up properly!
If you're careful to set up your links properly, and you don't use the
same name twice (one lower- and one upper-case) for different documents,
it shouldn't be a problem.

The only other "fix" would be to edit htdig/Retriever.cc, and find all
instances where the "visited" object is used. These are preceeded by
an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
You'd need to remove these, or replace them with calls to a function
that would only map the first part of the URL (http://host.name.dom/)
to lower-case, and leave the rest of the path as mixed case. Removing
the lowercase() calls altogether would mean you'd have to be consistent
in the case used in the hostname part of the URLs - probably not a safe
assumption given the fact that your site isn't even consistent in the
case expected for the document names. Mapping the first part of the
URL, but leaving the path as mixed case would solve your problem, but
could pose a problem if you index any case-insensitive servers.

So, to answer the question you pose in the subject line, it's not a bug,
it's a feature! :-)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:53 PST