Re: [htdig] dissapearing urls?


Subject: Re: [htdig] dissapearing urls?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Aug 24 2000 - 14:37:54 PDT


According to Sasa Mutic:
> I have a big start.url file with around 11000 URL's. I have splited it
> into 30 smaller files and run "htdig -s" "htmerge" after each completed.
> Everything was fine until I finished somethign like 4000 url's. (the
> db.url file was showing over 10 million at the time, but that is with
> many duplicates). Then I noticed that I dont get some hits anymore that
> I was getting before. For example, my own homepage that was indexed at
> start was among hits before, now I get 0 hits for the same word. Also
> noticed that hits for some other words have lowered.
>
> It seems like htsearch doesnt search through whole db file anymore. Or
> maybe file is limited in zise and it truncates the beggining after
> adding more at the end??
>
> filesizes:
> db.docdb: 829570K
> db.docs.index: 4447232
> db.urls: 924091K
> db.wordlist: 879319K
> db.words.db: 699739K
> -------------------------------------
> total: 3,333 GB

I was hoping someone with more knowledge of the DB internals would
respond, but no one bit. While those databases are large, it doesn't seem
to me you've hit any sort of size limit yet. The symptoms do suggest some
sort of database corruption, though, but it's not clear why it happened.
About all I can suggest is that you start over, and perhaps try fewer
files with more URLs in each.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Aug 24 2000 - 14:38:46 PDT