Re: htdig: geocities robots.txt


Ryan Scott (ryan@netcreations.com)
Fri, 15 Jan 1999 17:59:59 -0500


Geoff Hutchison wrote:

> At 4:54 PM -0500 12/14/98, Ryan Scott wrote:
> >I still cannot index geocities pages. Here's what they put in their
> >robots.txt file:
> >
> ># htdig knows where to go.
> >User-agent: htdig/3.1.0b1
>
> >I'm not too familiar with how it is all supposed to be but it appears this
> >doesn't cut it. I'm trying to index various neighborhoods on request of the
> >folks running those neighborhoods, in case you were a wonderin.
>
> Either get Geocities to set the User-agent: htdig or set the robotstxt_name
> option to "htdig/3.1.0b1" in your conf file. The problem is that the
> robots.txt name must match *exactly* and it doesn't.

Ok, I've been working with this, and it seems that the entire robots.txt that
geocities uses is not processed. Their file is very large, btw.

It basically stops right after architext, having not found itself, and then it
htdig refuses to bring any geocities pages back. Is there a filesize limit with
robots.txt?

Here's much of the output, sorry for the length:

New server: www.geocities.com, 80
Retrieval command for http://www.geocities.com/robots.txt: GET /robots.txt
HTTP/1.0
User-Agent: htdig/3.1.0b1 (rscott@netcreations.com)
Host: www.geocities.com

Header line: HTTP/1.1 200 OK
Header line: Date: Fri, 15 Jan 1999 22:52:02 GMT
Header line: Server: Apache/1.2.6
Header line: Last-Modified: Thu, 14 Jan 1999 22:05:26 GMT
Translated Thu, 14 Jan 1999 22:05:26 GMT to Thu, 14 Jan 1999 22:05:26 (99)
And converted to Thu, 14 Jan 1999 22:05:26
Header line: ETag: "c6f2-6058-369e6a26"
Header line: Content-Length: 24664
Header line: Accept-Ranges: bytes
Header line: Connection: close
Header line: Content-Type: text/plain
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read a total of 8192 bytes (I think this is the problem! their server says it
is 24664 bytes, not 8192)
Parsing robots.txt file using myname = htdig

<snipped out the part where it went over the first few entries>

Robots.txt line: # Excite (formerly known as Architext) knows where to go.
Robots.txt line: User-agent: ArchitextSpider
Found 'user-agent' line: ArchitextSpider
Robots.txt line: Disallow: /admin/ # all paths except neighborhoods
and members section are disallowed
Robots.txt line: Disallow: /auditor/
Robots.txt line: Disallow: /cgi_emails/
Robots.txt line: Disallow: /cgi_html/
Robots.txt line: Disallow: /cgi-bin/
Robots.txt line: Disallow: /chat/
Robots.txt line: Disallow: /classes/
Robots.txt line: Disallow: /companies/
Robots.txt line: Disallow: /dbm_files/
Robots.txt line: Disallow: /demos/
Robots.txt line: Disallow: /error_messages/
Robots.txt line: Disallow: /errors/
Robots.txt line: Disallow: /features/
Robots.txt line: Disallow: /GeoPartners/
Robots.txt line: Disallow: /geoplus/
Robots.txt line: Disallow: /geoshops/
Robots.txt line: Disallow: /geostore/
Robots.txt line: Disallow: /geoworld/
Robots.txt line: Disallow:/GreetingCards/
Robots.txt line: Disallow: /guide/
Robots.txt line: Disallow: /homestead/
Robots.txt line: Disallow: /hoodpages/
Robots.txt line: Disallow: /htmlfrag/
Robots.txt line: Disallow: /images/
Robots.txt line: Disallow: /include/
Robots.txt line: Disallow: /index.html
Robots.txt line: Disallow: /java/
Robots.txt line: Disallow: /join/
Robots.txt line: Disallow: /LunarAw
Pattern: /
robots.txt: discarding 'http://www.geocities.com/~olelo', which = 0, length = 1
robots.txt: discarding 'http://www.geocities.com/~olelo/', which = 0, length = 1
pick: www.geocities.com:80, # servers = 1

It looks to me like you are asking for up to 8 K of robots.txt, which isn't
enough for this whole file.

So how can we work around this or fix it?

Thanks for any info.

Ryan

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 20 1999 - 08:37:45 PST