Re: htdig: geocities robots.txt


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 18 Jan 1999 13:34:26 -0600 (CST)


According to Ryan Scott:
> Ok, I've been working with this, and it seems that the entire robots.txt that
> geocities uses is not processed. Their file is very large, btw.
>
> It basically stops right after architext, having not found itself, and then it
> htdig refuses to bring any geocities pages back. Is there a filesize limit with
> robots.txt?
>
> Here's much of the output, sorry for the length:
>
>
> New server: www.geocities.com, 80
> Retrieval command for http://www.geocities.com/robots.txt: GET /robots.txt
> HTTP/1.0
> User-Agent: htdig/3.1.0b1 (rscott@netcreations.com)
> Host: www.geocities.com
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Fri, 15 Jan 1999 22:52:02 GMT
> Header line: Server: Apache/1.2.6
> Header line: Last-Modified: Thu, 14 Jan 1999 22:05:26 GMT
> Translated Thu, 14 Jan 1999 22:05:26 GMT to Thu, 14 Jan 1999 22:05:26 (99)
> And converted to Thu, 14 Jan 1999 22:05:26
> Header line: ETag: "c6f2-6058-369e6a26"
> Header line: Content-Length: 24664
> Header line: Accept-Ranges: bytes
> Header line: Connection: close
> Header line: Content-Type: text/plain
> Header line:
> returnStatus = 0
> Read 8192 from document
> Read 8192 from document
> Read a total of 8192 bytes (I think this is the problem! their server says it
> is 24664 bytes, not 8192)
> Parsing robots.txt file using myname = htdig
[snip]
> It looks to me like you are asking for up to 8 K of robots.txt, which isn't
> enough for this whole file.
>
> So how can we work around this or fix it?

There are actually two separate problems here. First of all,
htdig/Server.cc has a hard-coded size limit of 10000 bytes for the
robots.txt file, which should be changed. Setting it to 0 will make
the Document constructor use the "max_doc_size" attribute, which
puts this limit under user control. Alternatively, you could replace
the 10000 with whatever hard-coded limit you want, or introduce a new
configuration attribute, e.g. max_robotstxt_size, and replace the 10000
with config.Value("max_robotstxt_size") instead. Personally, I think
the patch below is adequate, as max_doc_size is almost always going to
be generous enough to handle the robots.txt file.

--- ./htdig/Server.cc.robots Thu Dec 10 20:54:07 1998
+++ ./htdig/Server.cc Mon Jan 18 12:35:09 1999
@@ -64,7 +64,7 @@
     //
     String url = "http://";
     url << host << ':' << port << "/robots.txt";
- Document doc(url, 10000);
+ Document doc(url, 0);
     switch (doc.RetrieveHTTP(0))
     {
         case Document::Document_ok:

Secondly, there's a bug in RetrieveHTTP() and RetrieveLocal(), in how
they deal with files that are over the size limit. These functions read
in the file 8K at a time, and if appending the most recent 8K chunk would
take the string over the size limit, the chunk is tossed out instead of
truncating it to the length you requested. This patch will solve that
problem.

--- ./htdig/Document.cc.robots Wed Jan 13 15:20:50 1999
+++ ./htdig/Document.cc Mon Jan 18 12:38:27 1999
@@ -511,8 +511,10 @@
         if (debug > 2)
             cout << "Read " << bytesRead << " from document\n";
         if (contents.length() + bytesRead > max_doc_size)
- break;
+ bytesRead = max_doc_size - contents.length();
         contents.append(docBuffer, bytesRead);
+ if (contents.length() >= max_doc_size)
+ break;
     }
     c.close();
     document_length = contents.length();
@@ -657,8 +659,10 @@
         if (debug > 2)
             cout << "Read " << bytesRead << " from document\n";
         if (contents.length() + bytesRead > max_doc_size)
- break;
+ bytesRead = max_doc_size - contents.length();
         contents.append(docBuffer, bytesRead);
+ if (contents.length() >= max_doc_size)
+ break;
     }
     fclose(f);
     document_length = contents.length();

Both patches were to the htdig-3.1.0b5dev-011299 source, but should be
applicable to the 3.1.0b4 source as well.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 20 1999 - 08:37:46 PST