Re: [htdig] db sizes with 3.1.5


Subject: Re: [htdig] db sizes with 3.1.5
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Feb 28 2000 - 15:39:13 PST


According to Rusty Wright:
> I just upgraded to 3.1.5 from 3.1.2; thanks for maintaining this
> software.
>
> With 3.1.5 I noticed the files in the db directory are about 3 times
> bigger than with 3.1.2; is this normal? I couldn't find any mention
> of this in the release notes on the web site.
>
> Here are the listings of the two directories:
>
> htdig-3.1.2/db:
> total 66704
> 21888 -rw-r--r-- 1 root other 11191296 Feb 28 05:05 db.docdb
> 800 -rw-r--r-- 1 root other 401408 Feb 28 05:05 db.docs.index
> 22704 -rw-r--r-- 1 root other 11611928 Feb 28 05:05 db.wordlist
> 21312 -rw-r--r-- 1 root other 10903552 Feb 28 05:05 db.words.db
>
> htdig-3.1.5/db:
> total 163712
> 38048 -rw-r--r-- 1 root other 19453952 Feb 28 12:23 db.docdb
> 1104 -rw-r--r-- 1 root other 555008 Feb 28 12:23 db.docs.index
> 65408 -rw-r--r-- 1 root other 33462478 Feb 28 12:23 db.wordlist
> 59152 -rw-r--r-- 1 root other 30256128 Feb 28 12:23 db.words.db

The fact that the db.docdb almost doubled in size suggests that htdig
indexed more documents than before, and perhaps got somewhat more text
in the excerpts. If you run both versions with the -s option, you'll
get stats on this. I assume you ran both with the same configuration.
If not, there could be a reason, other than changes in the software,
for htdig indexing more text. The db.wordlist more than doubled in size,
so it would seem that it got more words per document as well, but again
without clear stats on this it's hard to say for sure.

There were a number of changes between 3.1.3 and 3.1.5 that could have
caused htdig to index more document and/or more words per document:

In 3.1.3:
- Fixed a bug where SGML entities inside HTML tags were not expanded.
  (This could cause more hrefs to be followed, if SGML entities were used
  in some hrefs.)
- Fixed a bug in URL parsing, where documents ending in the value used for
  remove_default_doc were ignored. For example, a URL ending in
  /left_index.html would become /.
  (This could allow htdig to find documents or entire subtrees of a site
  that were missed.)
- Fixed META robot parsing to correctly parse multiple directives.
  (This could allow htdig to index some documents it erroneously rejected
  before, or follow links it didn't before.)
- Added support for <EMBED>, <OBJECT>, and <LINK> HTML tags.
  (This could allow htdig to find documents it was missing before.)
- When indexing, htdig should now attempt to index compound words as
  separate words in addition to a compound word.
  (This would not affect the document count, but would increase the word
  count.)
In 3.1.4:
- HTML parser now indexes text in alt parameter of img tags...
  (This would not affect the document count, but would increase the word
  count and the excerpt size in db.docdb.)
In 3.1.5:
- Fixed a bug that could cause problems with 8-bit characters on some systems.
  (This could have an impact on word count.)
- Fixed htdig's handling of robots.txt, such that only the first applicable
  User-agent field bearing its name will be used, rather than only the last.
  (This could allow htdig to index some documents it erroneously rejected
  before, or could be a symptom of an erroneous robots.txt file.)
- Fixed handling of relative URLs with trailing ".." or leading "//".
  (This could allow htdig to find documents or entire subtrees of a site
  that were missed.)

Any of these could account for extra data in your databases, which
would be a symptom of fixed bugs, and not new bugs. If you can point
out anything concrete that would suggest inefficiencies or data that
should not be added to the database, please let us know. However,
without anything more than database sizes to go on, it's pretty hard to
determine if the size increase is unusual or not.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Feb 28 2000 - 15:43:18 PST