Re: [htdig] Fw: [htdig] - Question for start_url and exclude_urls


Subject: Re: [htdig] Fw: [htdig] - Question for start_url and exclude_urls
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Jan 05 2001 - 10:28:13 PST


According to "Mohai Wang" <mwang@coreon.net>:
> > 1. start_url:
> > as long as start_url = "http://stagsite.coreon.com/download/". When I
> run
> > "rundig -vvv >log", I got error message from screen "DB2 problem...:
> missing
> > or empty key value specified". I also attached debug mode "log" and
> > "htdig.conf" files, please take a look. Did I set wrong option?
> > If start_url = "http://stagsite.coreon.com/" that it will go through to
> > write index, because I only need to write everything under "download"
> > nothing else.

The "missing or empty key value specified" error happens when the one
and only entry in the db.docdb database is deleted because the document
could not be fetched. I.e. this is a symptom, and not the root cause of
the problem. The root cause is very clearly indicated in your attached
"log.dat" file:

> 0:0:0:http://stagesite.coreon.com/download/: Retrieval command for http://stagesite.coreon.com/download/: GET /download/ HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Host: stagesite.coreon.com
>
> Header line: HTTP/1.1 403 Forbidden

The 403 Forbidden error means htdig could not fetch the only document
specified in your start_url, i.e. the /download/ directory. 403 errors
are almost always the result of file permission problems. The web
server's user ID does not have read permission (or search/execute
permission) on that directory, so no web client can access it from your
web server. You'd almost certainly get the same error from your web
browser if you attempted to look at that directory from there using this
same URL.

> > 2. exclude_urls:
> > I try to do something differently, start_url =
> > "http://stagsite.coreon.com/" then I added exclude_urls = "/cgi-bin/
> > /calendar/ /coreonlib/". When I run "rundig -vvv >log3", it will read
> > /coreonlib/ first then stop. After I took off "coreonlib" from
> exclude_urls
> > then rerun "rundig -vvv >log2" that everything are indexing and reject
> > "cgi-bin" and "calendar". Could you tell me why? Please take a look log3
> > file.
...
> 0:0:0:http://stagesite.coreon.com/: Retrieval command for http://stagesite.coreon.com/: GET / HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Host: stagesite.coreon.com
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Thu, 04 Jan 2001 16:27:48 GMT
> Header line: Server: Apache/1.3.12 (Unix) tomcat/1.0 mod_perl/1.24 mod_ssl/2.6.6 OpenSSL/0.9.4
> Header line: Last-Modified: Tue, 12 Dec 2000 02:14:53 GMT
> Translated Tue, 12 Dec 2000 02:14:53 GMT to 2000-12-12 02:14:53 (100)
> And converted to Tue, 12 Dec 2000 02:14:53
> Header line: ETag: "48890-dc0-3a358a1d"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 3520
> Header line: Connection: close
> Header line: Content-Type: text/html
> Header line:
> returnStatus = 0
> Read 3520 from document
> Read a total of 3520 bytes
>
> title: Insite
> href: http://stagesite.coreon.com/coreonlib/html/top_index.htm ()
>
> Rejected: Item in the exclude list: item # 1 length: 11
>
> url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/top_index.htm
> href: http://stagesite.coreon.com/coreonlib/html/main.html ()
>
> Rejected: Item in the exclude list: item # 1 length: 11
>
> url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/main.html
> size = 3520
> pick: stagesite.coreon.com, # servers = 1
> htmerge: Sorting...
> htmerge: Merging...
>
> 0/http://stagesite.coreon.com/

This log3.dat file doesn't look complete to me. With the third level of
verbosity that you'd need to get detailed rejection messages like above,
I think you should be getting much more detail than that. Is this
just an excerpt of the full log? From what I can see above, it seems
that htdig is only picking up two links from your main index page, and
both are rejected. This is what you want, according to your comments
above, because log3 is the result of running htdig with /coreonlib/
in exclude_urls. The question is why does htdig not pick up and use any
other links, and I can't answer that if I don't have the complete log.
Does the complete log indicate more links than that, and if so, what
are the reasons for rejection? If htdig doesn't see any links other
than those two, you need to find out why. Are you expecting it to see
JavaScript links? It won't! See the FAQ (http://www.htdig.org/FAQ.html),
especially questions 5.25 and 5.27. Perhaps htdig doesn't see any links
to the rest of your site on the main index page, but does find them
somewhere in coreonlib when you allow it to look there. In this case,
you'd need to add something on your main index page that htdig can follow
to get to the rest of the site.

Also, please try to examine your logs more thoroughly, as errors like
the 403 error above shouldn't be dismissed so easily as inconsequential.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Fri Jan 05 2001 - 10:40:14 PST