Re: [htdig] htmerge: Deleted, no excerpt problem


Subject: Re: [htdig] htmerge: Deleted, no excerpt problem
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri May 19 2000 - 14:42:47 PDT


According to Andre Dalle:
> Chunks of our web site are failing to index due to being
> dropped by htmerge.
...
> I have checked the mailing list archives, and am sure the usual
> suggested problems are not at fault..
>
> - robots.txt does not exclude the file (htdig should have never indexed
> it in the first place if that was the case?)

That's true, but htdig adds an entry to the db.docdb when it first sees
a link to a file, without checking at that point if it's disallowed.
Only later, when it actually pulls the name of it off the queue to index
it, does it then determine that it's disallowed, so it flags it as such,
but leaves the db.docdb entry for htmerge to delete. 3.2 does this in
a cleaner way, by checking disallowed URLs when it first sees the href,
so it won't add any disallowed URL to the database to begin with.

> - server_max_docs is not in use and is definitely not at fault
> - no 'noindex' or robot meta-tag in the html files
> - there are keyword/description tags as well as plenty of text to search
...
> Initial HTDIG run:
>
> htdig# ./htdig -i -a -v -s
>
> New server: www.ncf.ca, 80
> 0:0:0:http://www.ncf.ca/rapa: redirect
> 1:1:0:http://www.ncf.ca/rapa/: ++++++** size = 5201
...
> htdig: www.ncf.ca:80 8 documents
> htdig# ./htmerge -vvv -s -a
> htmerge: Sorting...
> htmerge: Removing doc #0
...
> Deleted, no excerpt: 0/http://www.ncf.ca/rapa
...
> htmerge: Total documents: 7
> htmerge: Total doc db size (in K): 67

The only document entry it deleted above was one that htdig created for
http://www.ncf.ca/rapa, which is not a complete URL. The server gave
htdig a redirect for that URL, to http://www.ncf.ca/rapa/, which did
remain in the database. This is quite normal, and no actual document
was deleted. Can you run a larger test case and find an example of a
document that is deleted which shouldn't be?

> #
> max_head_length: 100000
> max_meta_description_lenth: 1000
> max_description_lenth: 100

The above two entries are misspelled. Apart from that, I see no error
in either your htdig.conf or your htdig/htmerge output.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri May 19 2000 - 12:31:00 PDT