Re: [htdig] "skipped"


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 29 Oct 1999 13:07:42 -0500 (CDT)


According to Kari Suomela:
> This happens only on *my* own site. The others work ok. What is causing
> it to be 'skipped'? There is no "robots.txt". What is the reason for
> looking for one?

This file tells search engines what they are not allowed to index. If
it doesn't find one, it assumes the whole site is fair game.

See http://www.archive.org/robotexclusion.html for details, or
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html for
LOTS of details.

As for the "skipped" message, it says that when it encounters an URL
that it's already visited, or is already on its list of pages to get.

> === Cut ===
> 0:0:http://www.karicobs.com/

OK, this is the first time it encounters your URL. That's what prompts
it to set up a new server object, and look up the robots.txt file.

> New server: www.karicobs.com, 80
> Retrieval command for http://www.karicobs.com/robots.txt: GET
> /robots.txt HTTP/1.0
> User-Agent: htdig/3.1.3 (webmaster@karicobs.com)
> Host: www.karicobs.com
>
> Header line: HTTP/1.1 404 Not Found
> Header line: Date: Fri, 29 Oct 1999 15:08:25 GMT
> Header line: Server: Apache/1.3.9 (Unix)
> Header line: Connection: close
> Header line: Content-Type: text/html
> Header line:
> returnStatus = 1
> pushed
> 1:0:http://www.karicobs.com/ skipped

OK, this is the second time it encounters the same URL, so obviously
it's not going to add it to it's list of pages to index, hence the
"skipped" message. Not a problem, it's already on the list. Most likely,
it encounters the URL twice because it's in both the start_url list and
already in the database. It builds up it's whole list from the contents
of both of these.

> pick: www.karicobs.com, # servers = 1
> 0:1:255:http://www.karicobs.com/: Retrieval command for
> http://www.karicobs.com/: GET / HTTP/1.0
> User-Agent: htdig/3.1.3 (webmaster@karicobs.com)
> If-Modified-Since: Fri, 29 Oct 1999 15:07:04 GMT
> Host: www.karicobs.com
>

Here's where it issues the GET command for the home page of this site.
The server's response is below.

> Header line: HTTP/1.1 304 Not Modified
> Header line: Date: Fri, 29 Oct 1999 15:08:25 GMT
> Header line: Server: Apache/1.3.9 (Unix)
> Header line: Connection: close
> Header line: ETag: "b20a0-197a-3819b818"
> Header line:
> returnStatus = 2
> not changed
> pick: www.karicobs.com, # servers = 1

OK, so the page is already indexed, and hasn't been changed since the last
time it was indexed. What's the problem? Did the page actually change,
and the HTTP server incorrectly reported it as "Not Modified", or was
it correctly indexed before? If there was a problem with the indexing
before, you'll have to get htdig to reindex from scratch (-i option)
so that you can get a verbose listing of its parsing of this document.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Oct 29 1999 - 11:17:13 PDT