Re: htdig: geocities robots.txt


Ryan Scott (rscott@netcreations.com)
Mon, 18 Jan 1999 14:41:11 -0500


This doesn't really work for me because my max doc size is about 5000 bytes. I'm
indexing a LOT of pages with a lot of .confs and don't want to fetch that much of them. I
just had to replace a 9 gig drive with a 18 gig drive because I was running out of space
all the time.

Shouldn't robots.txt just be handled as a special case and ignore the max doc size? It
should grab the whole file.

> According to Ryan Scott:
> > Ok, I've been working with this, and it seems that the entire robots.txt
> > that geocities uses is not processed. Their file is very large, btw.
> >
> > It basically stops right after architext, having not found itself, and
> > then it htdig refuses to bring any geocities pages back. Is there a
> > filesize limit with robots.txt?
> >
> > Here's much of the output, sorry for the length:
> >
> >
> > New server: www.geocities.com, 80
> > Retrieval command for http://www.geocities.com/robots.txt: GET
> > /robots.txt HTTP/1.0 User-Agent: htdig/3.1.0b1 (rscott@netcreations.com)
> > Host: www.geocities.com
> >
> > Header line: HTTP/1.1 200 OK
> > Header line: Date: Fri, 15 Jan 1999 22:52:02 GMT
> > Header line: Server: Apache/1.2.6
> > Header line: Last-Modified: Thu, 14 Jan 1999 22:05:26 GMT
> > Translated Thu, 14 Jan 1999 22:05:26 GMT to Thu, 14 Jan 1999 22:05:26
> > (99) And converted to Thu, 14 Jan 1999 22:05:26 Header line: ETag:
> > "c6f2-6058-369e6a26" Header line: Content-Length: 24664 Header line:
> > Accept-Ranges: bytes Header line: Connection: close Header line:
> > Content-Type: text/plain Header line: returnStatus = 0 Read 8192 from
> > document Read 8192 from document Read a total of 8192 bytes (I think
> > this is the problem! their server says it is 24664 bytes, not 8192)
> > Parsing robots.txt file using myname = htdig
> [snip]
> > It looks to me like you are asking for up to 8 K of robots.txt, which
> > isn't enough for this whole file.
> >
> > So how can we work around this or fix it?
>
> There are actually two separate problems here. First of all,
> htdig/Server.cc has a hard-coded size limit of 10000 bytes for the
> robots.txt file, which should be changed. Setting it to 0 will make
> the Document constructor use the "max_doc_size" attribute, which
> puts this limit under user control. Alternatively, you could replace the
> 10000 with whatever hard-coded limit you want, or introduce a new
> configuration attribute, e.g. max_robotstxt_size, and replace the 10000
> with config.Value("max_robotstxt_size") instead. Personally, I think the
> patch below is adequate, as max_doc_size is almost always going to be
> generous enough to handle the robots.txt file.
>
> --- ./htdig/Server.cc.robots Thu Dec 10 20:54:07 1998
> +++ ./htdig/Server.cc Mon Jan 18 12:35:09 1999
> @@ -64,7 +64,7 @@
> //
> String url = "http://";
> url << host << ':' << port << "/robots.txt";
> - Document doc(url, 10000);
> + Document doc(url, 0);
> switch (doc.RetrieveHTTP(0))
> {
> case Document::Document_ok:
>
> Secondly, there's a bug in RetrieveHTTP() and RetrieveLocal(), in how they
> deal with files that are over the size limit. These functions read in the
> file 8K at a time, and if appending the most recent 8K chunk would take
> the string over the size limit, the chunk is tossed out instead of
> truncating it to the length you requested. This patch will solve that
> problem.
>
> --- ./htdig/Document.cc.robots Wed Jan 13 15:20:50 1999
> +++ ./htdig/Document.cc Mon Jan 18 12:38:27 1999
> @@ -511,8 +511,10 @@
> if (debug > 2)
> cout << "Read " << bytesRead << " from document\n";
> if (contents.length() + bytesRead > max_doc_size)
> - break;
> + bytesRead = max_doc_size - contents.length();
> contents.append(docBuffer, bytesRead);
> + if (contents.length() >= max_doc_size)
> + break;
> }
> c.close();
> document_length = contents.length();
> @@ -657,8 +659,10 @@
> if (debug > 2)
> cout << "Read " << bytesRead << " from document\n";
> if (contents.length() + bytesRead > max_doc_size)
> - break;
> + bytesRead = max_doc_size - contents.length();
> contents.append(docBuffer, bytesRead);
> + if (contents.length() >= max_doc_size)
> + break;
> }
> fclose(f);
> document_length = contents.length();
>
> Both patches were to the htdig-3.1.0b5dev-011299 source, but should be
> applicable to the 3.1.0b4 source as well.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba
> Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:
> (204)789-3930

______________________________________________________________________
Ryan Scott - rscott@netcreations.com - 212 625 1370
PostMaster Direct Response - Targeted 100% OPT IN Email
http://www.postmasterdirect.com/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 20 1999 - 08:37:46 PST