Re: htdig: geocities robots.txt


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 18 Jan 1999 14:23:30 -0600 (CST)


According to Ryan Scott:
> This doesn't really work for me because my max doc size is about 5000 bytes. I'm
> indexing a LOT of pages with a lot of .confs and don't want to fetch that much of them. I
> just had to replace a 9 gig drive with a 18 gig drive because I was running out of space
> all the time.

Well, in that case, you should use one of the two alternative
solutions I proposed. Either replace the hard-coded 10000 byte
limit in Server.cc with a much bigger limit, like 100000, or use
config.Value("max_robotstxt_size") instead, and then define that
attribute to be whatever you want in your htdig.conf file.

In any case, you should still apply my second patch to Document.cc

> Shouldn't robots.txt just be handled as a special case and ignore the max doc size?

It is currently handling it as a special case - the size limit is
hard-coded instead of it using max_doc_size. I suggested using
max_doc_size because the default for that is much more generous than
the hard-coded limit for the robots.txt file. I didn't anticipate
that anyone would want to set max_doc_size so low! Note that there is
an important difference between max_doc_size and max_head_length: the
former restricts the size of file that will be read from the server,
while the latter restricts the amount of document head that will be
stored in the database for excerpts. If your concern is the database
size, you should be limiting the max_head_length. Limiting max_doc_size
won't save database disk usage (unless you set it to something smaller
than max_head_length, which you shouldn't), but it will speed up the
dig by aborting the transfer of files larger than the limit you set.
It will also mean that these large documents will not be fully indexed,
and any links they contain beyond the limit you set will not be followed.

> It should grab the whole file.

For the robots.txt file, I agree that it should grab the whole file in
all cases. There are two alternatives for this: a) completely re-write
the code that fetches this file, to ignore any file size limit currently
imposed by the RetriveHTTP() function, or b) just set a nice, generous
limit. The second choice is easier by far, so that's my recommendation.
If you don't want the limit to be max_doc_size, then use max_robotstxt_size
as I described above, or hard-code a bigger limit.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 20 1999 - 08:37:46 PST