Re: htdig: Pages get indexed, but no results: BUG?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 17 Dec 1998 15:57:30 -0600 (CST)


According to Rodger Zeisler:
> htdig has a case_sensitive option that would make PRODUCT.HTM and
> product.htm appear the same. Since you have 3 character extenstions (.htm),
> I am assuming that you are on NT not Unix (.html). NT (and MS Windows) is
> case insenstive.

The case_sensitive option currently affects only parsing of disallow
statements in the robots.txt file, and not how htdig keeps track of
visited documents. However, if you want to change how htdig keeps
track of visits, as I suggested earlier, it would be wise to make that
conditional on this option. Thanks for pointing it out.

The .htm extensions have, unfortunately, polluted a great many Unix
servers over the past few years, as many web developers use M$ systems,
and stick to that ugly 3 character extension limit they've carried over
from DOS, even though Win95 & NT no longer impose that limit. So, it's
not a safe assumption that a server that has .htm files is not Unix-based.
If Andriu claims it's a Unix server, I'll take his word for it. In any
case, the problem stems from the fact that the developers assumed the
server was case insensitive, when in fact it's case sensitive, and
therefore not an NT server. That's why the href to PRODUCTS.HTM fails.
It should be lower-case, and the server cares which case is used.

According to Andriu Isenring Ritsch:
> I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK and has
> a lot of links on it to other pages etc., why does htdig assume that the link is
> not ok and deletes all pages that have been retrieved starting from the
> products.htm page? I mean, htdig got the pages and deletes them again - they
> defenitely must exist, so why deleting them?

Who says it's deleting anything? Does an htdig -vvv seem to suggest that?

What I'm suggesting is that htdig sees the href to PRODUCTS.HTM before any
href to products.htm, and so it queues up the upper-case URL, but marks
the lower-case URL as visited (because all visits are recorded in lower
case). So, it tries to get PRODUCTS.HTM, and fails, so it never sees the
real file. Whenever it sees any of the good hrefs to products.htm, it
thinks the file was already visited, so it doesn't queue it up again.

Do you have any hard evidence that htdig is indeed fetching products.htm
from the server, and deleting its hrefs?

> One could also do it the other way around: htdig could know that one link was not
> ok, but since it was able to follow the page at some point, there must be a valid
> page with that name, so the links from that page must also be valid etc. (htdig
> could assume that there was a uppercase/lowercase problem).
>
> What do you think about that?

The way I read the code, I can't see how htdig would have tried to dig
both products.htm and PRODUCTS.HTM in the same run. If it sees the
real file first, it shouldn't try the bogus one at all, and if it sees
the bogus one first, it seems it would ignore the real one, because it
thinks it's the same file.

> (I know, fixing the link should be easier, but still...)

Personally, I find it ridiculous that they want you to put all this work
into setting up and demoing a search engine, and they can't be bothered
to find one person to take 30 seconds to fix one bad URL in one document.
But that's life, I guess.

How's this for a work around: put the products.htm file as the first
URL in your start_url option, to make sure htdig sees it before the
bogus href. If you do so, and your limit_urls_to option refers to
${start_url}, as it does by default, you may want to explicitly specify
the limit_urls_to option. E.g.:

start_url: http://silly.server.com/some/path/to/products.htm \
        http://silly.server.com/
limit_urls_to: http://silly.server.com/

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:54 PST