Subject: [htdig] htmerge: Deleted, no excerpt problem
From: Andre Dalle (adalle@freenet.carleton.ca)
Date: Thu May 18 2000 - 23:02:06 PDT

Chunks of our web site are failing to index due to being
dropped by htmerge.

I upgraded to ht://dig 3.1.5 but there is no change in behaviour
relating to this particular problem, although I do appreciate many
of the smart new features like local filesystem access!

I have checked the mailing list archives, and am sure the usual
suggested problems are not at fault..

- robots.txt does not exclude the file (htdig should have never indexed
it in the first place if that was the case?)
- server_max_docs is not in use and is definitely not at fault
- no 'noindex' or robot meta-tag in the html files
- there are keyword/description tags as well as plenty of text to search

I am at a loss and otherwise I am very pleased with ht://dig - I will include
a sample htdig/htmerge run on a small part of the website and I dearly
hope that someone can shed some light on my problem!

Note also I am using large header/document limits as well - I don't think I'm
hitting any sort of configured limit at all; I've been through the documentation
and I can find no fault in my setup, which is basically the stock htdig.conf
with some of the default limits bumped up. I will attach the file just in case.

Feel free to GET http://www.ncf.ca/rapa/index.html.

I even removed /robots.txt for this run just to be sure ..

Initial HTDIG run:

htdig# ./htdig -i -a -v -s
 New server: www.ncf.ca, 80
 0:0:0:http://www.ncf.ca/rapa: redirect
 1:1:0:http://www.ncf.ca/rapa/: ++++++** size = 5201
 2:2:1:http://www.ncf.ca/rapa/RAPAHistory.html: ****** size = 43660
 3:3:1:http://www.ncf.ca/rapa/PlayHist.html: ****** size = 8691
 4:4:1:http://www.ncf.ca/rapa/Sponsors.html: ****** size = 3112
 5:5:1:http://www.ncf.ca/rapa/SponsorInfo.html: *******- size = 3367
 6:6:1:http://www.ncf.ca/rapa/Board.html: ****** size = 2421
 7:7:1:http://www.ncf.ca/rapa/WhatsOn.html: ******- size = 2793
 htdig: Run complete
 htdig: 1 server seen:
 htdig: www.ncf.ca:80 8 documents

htdig# ./htmerge -vvv -s -a
htmerge: Sorting...
htmerge: Removing doc #0
htmerge: Merging...
htmerge: 100:association
htmerge: 200:box
htmerge: 300:churchs
htmerge: 400:critical
htmerge: 500:drama
htmerge: 600:faithfully
htmerge: 700:gathered
htmerge: 800:his
htmerge: 900:jaston
htmerge: 1000:lighted
htmerge: 1100:mears
htmerge: 1200:night
htmerge: 1300:peer
htmerge: 1400:public
htmerge: 1500:robinson
htmerge: 1600:shirts
htmerge: 1700:such
htmerge: 1800:totten
htmerge: 1900:waltons
htmerge: Total word count: 1995
Deleted, no excerpt: 0/http://www.ncf.ca/rapa
htmerge: Total documents: 7
htmerge: Total doc db size (in K): 67

Andre Dalle			[adalle@ncf.ca]
Systems Administrator,
National Capital Freenet	[http://www.ncf.ca]

