Re: [htdig] Index of More that 2 Million html files


Geoff Hutchison (ghutchis@wso.williams.edu)
Fri, 14 May 1999 22:11:50 -0400


>it htdig at the index file but I seems to not read the complete index file,
>(index file is extreamly large about 63mb's or 2.3million lines) So What I

For performance and memory reasons, ht://Dig limits the size of documents
it reads in using the max_doc_size attribute (you'll need to set this
*much* larger): http://www.htdig.org/attrs.html#max_doc_size

I think you'll get much better performance by splitting your dig up into
smaller digs, for example several indexes with META robot tags set to
noindex, each of maybe 5,000 URLs. Even then, the problem is that to index
it, ht://Dig will need to store the list of 2+ million URLs before it even
hits the first document you want indexed. The thing is, spiders (ht://Dig
included) are optimized for indexing a network of pages. So in your
situation, rather than indexing one file and finding a few links, it
receives a lot of URLs up front that it has to store in memory.

One way you can get around this is to run several separate digs, each of a
smaller set of URLs and then merge the databases using htmerge.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri May 14 1999 - 19:29:59 PDT