Sadhunathan Nadesan
Date: Sat Mar 11 2000

aloha, htdig friends:

due to the request of the users, i recently increased the size and quantity
of pages that could be indexed and the amount of context surrounding the
returned references. this is apparently causing htdig to fail while
sorting. my question is, anyone know how i can control the disk space used
by sort? or, have any other suggestions?

more info below.

the htdig log file has this message:

htmerge: Sorting...
htmerge: Word sort failed

the indexing process that failed produced this system error message

/bin/sort: write error: No space left on device

now, checking the FAQ, under the limits of htdig, it says

Right now htmerge performs a sort
on the words indexed. Most sort programs use a fair amount of RAM and
temporary disk space as they assemble the sorted list. The htdig
program stores a fair amount of information about the URLs it visits, in
part to only index a page once. This takes a fair amount of RAM.
With cheap RAM, it never hurts to throw more memory at indexing larger
sites. In a pinch, swap will work, but it obviously really slows
things down.

all of the above to me implies that the htmerge sort uses the linux
/bin/sort program which in turn uses memory until it runs out, and then,
swap space. therefore a solution is to install more memory in the machine
and/or reinstall the operating system and reallocate more swap space. (moan
... i am 2,500 miles distant from the machine, so no chance of that.)

another solution might be to modify /bin/sort, or configure it, or replace
it with something which can be told what file system to use for temporary
space. i have enough disk space available on other file systems but not in
/ or in the swap space, apparently. that would seem to be the easiest.

or then again i might reconfigure htdig and sort of tune down the changes
made. the details of the changes are below. any suggestions appreciated.

many thanks in advance for your help!


the exact changes included

move the data base to a disk with more space

< database_dir: /opt/www/htdig/db.ht

> database_dir:         /gig/opt/www/htdig/db.ht

add more url's to search -------------------------

< start_url: http://www.hinduismtoday.kauai.hi.us http://www.hindu.org --- > start_url: http://www.gurudeva.dynip.com/~htoday/today/Archives/ http://www.himalayanacademy.com http://www.hindu.org http://ww w.hinduism-today.com/ http://www.gurudeva.org

increase document size allowed (we have a lot that are 150k plus) ------------------------------

< max_head_length: 10000 --- > max_head_length: 200000 131a135,140

increase context returned and other minor adjustments ------------------------------------------------------

> > # add new stuff experimental here > user_agent: htdig > allow_numbers: true > excerpt_length: 500

