Re: htdig: Pages get indexed, but no results: BUG?


Andriu Isenring Ritsch (webmaster@netsolution.ch)
Thu, 17 Dec 1998 17:57:36 +0100


Gilles Detillieux wrote:

> According to Andriu Isenring Ritsch:
> > I've noticed, that the links to the pages that don't seem to get indexed
> > all start on one page called products.htm
> >
> > Now there is the problem that the page that links to products.htm has
> > two links, one to products.htm and one to PRODUCTS.HTM.
> >
> > Because it's a Unix server and the page name is really products.htm,
> > PRODUCTS.HTM gives a page not found error.
> >
> > Is it now possible, that htdig removes all pages indexed starting form
> > products.htm, because PRODUCTS.HTM was not found?
> >
> > It seems to me like that...
> >
> > What workaround is there? Unfortunately most of the site is linked from
> > products.htm
>
> Is fixing the defective link not an option? If not, how about making
> sure htdig sees the good one before the bad one, somehow?
>
> As it stands now, htdig keeps track of visited URLs by mapping them to
> lower-case. This is valid for case-insensitive servers, but can be a
> problem with case-sensitive ones - when they're not set up properly!
> If you're careful to set up your links properly, and you don't use the
> same name twice (one lower- and one upper-case) for different documents,
> it shouldn't be a problem.
>
> The only other "fix" would be to edit htdig/Retriever.cc, and find all
> instances where the "visited" object is used. These are preceeded by
> an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
> You'd need to remove these, or replace them with calls to a function
> that would only map the first part of the URL (http://host.name.dom/)
> to lower-case, and leave the rest of the path as mixed case. Removing
> the lowercase() calls altogether would mean you'd have to be consistent
> in the case used in the hostname part of the URLs - probably not a safe
> assumption given the fact that your site isn't even consistent in the
> case expected for the document names. Mapping the first part of the
> URL, but leaving the path as mixed case would solve your problem, but
> could pose a problem if you index any case-insensitive servers.
>
> So, to answer the question you pose in the subject line, it's not a bug,
> it's a feature! :-)

I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK and has
a lot of links on it to other pages etc., why does htdig assume that the link is
not ok and deletes all pages that have been retrieved starting from the
products.htm page? I mean, htdig got the pages and deletes them again - they
defenitely must exist, so why deleting them?
One could also do it the other way around: htdig could know that one link was not
ok, but since it was able to follow the page at some point, there must be a valid
page with that name, so the links from that page must also be valid etc. (htdig
could assume that there was a uppercase/lowercase problem).

What do you think about that?

(I know, fixing the link should be easier, but still...)

Andriu

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:53 PST