Re: [htdig] Problem in creating database...


Subject: Re: [htdig] Problem in creating database...
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Aug 23 2000 - 13:48:16 PDT


According to Srini Sathya.:
> Well, i tried my luck with hostname instead of ipaddress, okay here is the
> detailed error.txt.
>
> Thanks a lot for ur patience,
> Srini
...

OK, now we have something we can work with. See my annotations below...

> 1:0:http://192.168.0.208/shipped/GIB
> New server: 192.168.0.208, 80
> Retrieval command for http://192.168.0.208/robots.txt: GET /robots.txt
> HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Host: 192.168.0.208
>
> Header line: HTTP/1.1 404 Not Found
> Header line: Date: Wed, 23 Aug 2000 19:24:04 GMT
> Header line: Server: Apache/1.3.12 (Unix) ApacheJServ/1.1.2
> PHP/4.0.1pl2
> Header line: Connection: close
> Header line: Content-Type: text/html; charset=iso-8859-1
> Header line:
> returnStatus = 1
> pushed

OK, before htdig fetches any files from a server, it looks for a robots.txt
file, according to the standard for robots exclusion. In your case it does
not find one, which is OK. It then continues with the start_url...

> pick: 192.168.0.208, # servers = 1
> 0:0:0:http://192.168.0.208/shipped/GIB: Retrieval command for
> http://192.168.0.208/shipped/GIB: GET /shipped/GIB HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Host: 192.168.0.208
>
> Header line: HTTP/1.1 301 Moved Permanently
> Header line: Date: Wed, 23 Aug 2000 19:24:04 GMT
> Header line: Server: Apache/1.3.12 (Unix) ApacheJServ/1.1.2
> PHP/4.0.1pl2
> Header line: Location: http://192.168.0.208/shipped/GIB/
> Header line: Connection: close
> Header line: Content-Type: text/html; charset=iso-8859-1
> Header line:
> returnStatus = 3
> redirect
> redirect: http://192.168.0.208/shipped/GIB/
> resolving 'http://192.168.0.208/shipped/GIB/'
> pushing http://192.168.0.208/shipped/GIB/

Your start_url is a directory URL, but it's missing the trailing slash to
identify it as a directory URL, so the server issues a redirect to correct
this. Standard procedure, and htdig handles it fine.

> pick: 192.168.0.208, # servers = 1
> 1:1:0:http://192.168.0.208/shipped/GIB/: Retrieval command for
> http://192.168.0.208/shipped/GIB/: GET /shipped/GIB/ HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Referer: http://192.168.0.208/shipped/GIB
> Host: 192.168.0.208
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Wed, 23 Aug 2000 19:24:04 GMT
> Header line: Server: Apache/1.3.12 (Unix) ApacheJServ/1.1.2
> PHP/4.0.1pl2
> Header line: Last-Modified: Thu, 17 Aug 2000 15:56:19 GMT
> Translated Thu, 17 Aug 2000 15:56:19 GMT to 2000-08-17 15:56:19
> (100)
> And converted to Thu, 17 Aug 2000 15:56:19
> Header line: ETag: "3e1ed-889-399c0b23"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 2185
> Header line: Connection: close
> Header line: Content-Type: text/html
> Header line:
> returnStatus = 0
> Read 2185 from document
> Read a total of 2185 bytes

OK, it read the index file for the start_url, and now it parses it...

> Tag: HTML>, matched -1
> Tag: HEAD>, matched -1
> Tag: META NAME="GENERATOR" Content="Microsoft Visual Studio 6.0">,
> matched 20
> Tag: TITLE>, matched 0
> Tag: /TITLE>, matched 1
>
> title:
> Tag: /HEAD>, matched -1
> Tag: BODY bgcolor="#FFFFFF">, matched -1
> Tag: table width="100%" border="0" cellspacing="0"
> cellpadding="0" height="100%">, matched -1
> Tag: tr valign="middle" align="center">, matched -1
> Tag: td>, matched -1
> Tag: table border="0" cellspacing="0" cellpadding="0">, matched
> -1
> Tag: tr>, matched -1
> Tag: td>, matched -1
> Tag: OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
>
> codebase="http://active.macromedia.com/flash2/cabs/swflash.cab#version
> =2,0,0,0"
> ID=intro animation 4 WIDTH=550 HEIGHT=275>, matched 25
> Tag: PARAM NAME=movie VALUE="flash/intro_animation_4.swf">, matched
> -1
> Tag: PARAM NAME=quality VALUE=autohigh>, matched -1
> Tag: PARAM NAME=bgcolor VALUE=#FFFFFF>, matched -1
> Tag: SCRIPT LANGUAGE=JavaScript>, matched -1
> Tag: /SCRIPT>, matched -1

I did mention before that htdig does not parse JavaScript. If you have
any JavaScript links to other documents here, htdig will not see them.

> Tag: NOEMBED>, matched -1
> Tag: IMG SRC="flash/intro_animation_4.gif" WIDTH=550 HEIGHT=275
> BORDER=0>, matched 18
> image: http://192.168.0.208/shipped/GIB/flash/intro_animation_4.gif
> Tag: /NOEMBED>, matched -1
> Tag: NOSCRIPT>, matched -1
> Tag: IMG SRC="flash/intro_animation_4.gif" WIDTH=550 HEIGHT=275
> BORDER=0>, matched 18
> image: http://192.168.0.208/shipped/GIB/flash/intro_animation_4.gif
> Tag: /NOSCRIPT>, matched -1
> Tag: /OBJECT>, matched -1
> Tag: /td>, matched -1
> Tag: /tr>, matched -1
> Tag: tr>, matched -1
> Tag: td height=5>, matched -1
> Tag: /td>, matched -1
> Tag: /tr>, matched -1
> Tag: tr align="right">, matched -1
> Tag: td>, matched -1
> Tag: a href="intro.htm">, matched 2
> A tag: pos = 2, position = ="intro.htm">
> Tag: img src="images/icons/skip_intro.gif" width="77" height="14"
> border="0" alt="skip intro" vspace="0" hspace="0">, matched
> 18
> word: skip@840
> word: intro@844
> image: http://192.168.0.208/shipped/GIB/images/icons/skip_intro.gif
> Tag: /a>, matched 3
> href: http://192.168.0.208/shipped/GIB/intro.htm (skip intro )
> resolving 'http://192.168.0.208/shipped/GIB/intro.htm'
>
> pushing http://192.168.0.208/shipped/GIB/intro.htm

htdig just encountered the first HTML link in the shipped/GIB/ index file,
which it accepted and pushed for later retrieval.

> +Tag: /td>, matched -1
> Tag: /tr>, matched -1
> Tag: /table>, matched -1
> Tag: /td>, matched -1
> Tag: /tr>, matched -1
> Tag: /table>, matched -1
> Tag: /BODY>, matched -1
> Tag: /HTML>, matched -1
> size = 2185

That's the end of the first file. It only found one link in it, which it
now fetches...

> pick: 192.168.0.208, # servers = 1
> 2:2:1:http://192.168.0.208/shipped/GIB/intro.htm: Retrieval command for
> http://192.168.0.208/shipped/GIB/intro.htm: GET /shipped/GIB/intro.htm
> HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Referer: http://192.168.0.208/shipped/GIB/
> Host: 192.168.0.208
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Wed, 23 Aug 2000 19:24:04 GMT
> Header line: Server: Apache/1.3.12 (Unix) ApacheJServ/1.1.2
> PHP/4.0.1pl2
> Header line: Last-Modified: Thu, 17 Aug 2000 14:40:25 GMT
> Translated Thu, 17 Aug 2000 14:40:25 GMT to 2000-08-17 14:40:25
> (100)
> And converted to Thu, 17 Aug 2000 14:40:25
> Header line: ETag: "3e1ee-23c-399bf959"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 572
> Header line: Connection: close
> Header line: Content-Type: text/html
> Header line:
> returnStatus = 0
> Read 572 from document
> Read a total of 572 bytes

It got the intro.htm file, and now it parses it...

> Tag: html>, matched -1
> Tag: head>, matched -1
> Tag: title>, matched 0
> word: Deutsche@36
> word: Bank@52
> Tag: /title>, matched 1
>
> title: Deutsche Bank
> Tag: /head>, matched -1
> Tag: frameset rows="*,1" marginwidth="0" marginheight="0"
> framespacing="0" frameborder="0" border="no" noresize>, matched
> -1
> Tag: frame src="tframe.htm" marginwidth="0" marginheight="0"
> framespacing="0" frameborder="0" border="no" noresize
> scrolling="no">, matched 21
> href: http://192.168.0.208/shipped/GIB/tframe.htm ()
>
> Rejected: Item in the exclude list: item # 10 length: 11
>
> url rejected: (level 1)http://192.168.0.208/shipped/GIB/tframe.htm

htdig just encountered the first HTML link in the intro.htm file, in
the form of a frame tag, which it rejects because tframe.htm is in your
exclude_urls attribute, as you wanted.

> Tag: frame src="bframe.htm" marginwidth="0" marginheight="0"
> framespacing="0" frameborder="0" border="no" noresize
> scrolling="no" name="tframe">, matched 21
> href: http://192.168.0.208/shipped/GIB/bframe.htm ()
> resolving 'http://192.168.0.208/shipped/GIB/bframe.htm'
>
> pushing http://192.168.0.208/shipped/GIB/bframe.htm

htdig just encountered the second HTML link in intro.htm, also in a frame
tag, which it accepted and pushed for later retrieval.

> +Tag: /frameset>, matched -1
> Tag: noframes>, matched -1
> Tag: body bgcolor="#FFFFFF">, matched -1
> Tag: p>, matched -1
> word: Your@839
> word: browser@848
> word: not@867
> word: able@874
> word: not@888
> word: configured@895
> word: view@919
> word: frames@928
> Tag: /p>, matched -1
> Tag: /body>, matched -1
> Tag: /noframes>, matched -1
> Tag: /html>, matched -1
> size = 572

That's the end of the second file. It only found one valid, non-excluded
link in it, which it now fetches...

> pick: 192.168.0.208, # servers = 1
> 3:3:2:http://192.168.0.208/shipped/GIB/bframe.htm: Retrieval command
> for http://192.168.0.208/shipped/GIB/bframe.htm: GET
> /shipped/GIB/bframe.htm HTTP/1.0
> User-Agent: htdig/3.1.5 (unconfigured@htdig.searchengine.maintainer)
> Referer: http://192.168.0.208/shipped/GIB/intro.htm
> Host: 192.168.0.208
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Wed, 23 Aug 2000 19:24:04 GMT
> Header line: Server: Apache/1.3.12 (Unix) ApacheJServ/1.1.2
> PHP/4.0.1pl2
> Header line: Last-Modified: Thu, 17 Aug 2000 14:38:30 GMT
> Translated Thu, 17 Aug 2000 14:38:30 GMT to 2000-08-17 14:38:30
> (100)
> And converted to Thu, 17 Aug 2000 14:38:30
> Header line: ETag: "3e102-5d-399bf8e6"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 93
> Header line: Connection: close
> Header line: Content-Type: text/html
> Header line:
> returnStatus = 0
> Read 93 from document
> Read a total of 93 bytes

It got the bframe.htm file, and now it parses it...

> Tag: html>, matched -1
> Tag: head>, matched -1
> Tag: title>, matched 0
> word: Deutsche@223
> word: Bank@319
> Tag: /title>, matched 1
>
> title: Deutsche Bank
> Tag: /head>, matched -1
> Tag: body bgcolor="#FFFFFF">, matched -1
> Tag: /body>, matched -1
> Tag: /html>, matched -1
> size = 93
> pick: 192.168.0.208, # servers = 1

That's the end of the file. There's just not a whole lot to it. The
problem is not in your configuration file, unless you want htdig to
index tframe.htm after all. The problem is in your HTML files, which
just don't seem to contain many links to other files. Remember that
htdig is a spider - it only follows HTML links.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Aug 23 2000 - 13:49:03 PDT