Re: [htdig] Pb indexing HTML with htdig 3.1.5


Subject: Re: [htdig] Pb indexing HTML with htdig 3.1.5
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Dec 07 2000 - 10:18:24 PST


According to =?iso-8859-1?Q?Andr=E9?= LAGADEC:
> I use htdig 3.1.5 on a Red Hat Linux 5.0, and I want to index a new web
> site. But when I run rundig I get only one document.
>
> So to see what is doing, I use rundig -vvvvvvv and I get this output :
> Header line: HTTP/1.1 200 OK
> Header line: Server: Netscape-Enterprise/3.5.1C
> Header line: Date: Wed, 06 Dec 2000 07:32:02 GMT
> Header line: Content-type: text/html
> Header line: Last-modified: Mon, 15 Nov 1999 10:45:01 GMT
> Translated Mon, 15 Nov 1999 10:45:01 GMT to 1999-11-15 10:45:01 (99)
> And converted to Mon, 15 Nov 1999 10:45:01
> Header line: Content-length: 1258
> Header line: Accept-ranges: bytes
> Header line: Connection: close
> Header line:
> returnStatus = 0
> Read 1258 from document
> Read a total of 1258 bytes
> Tag: html>, matched -1
> head:
> size = 1258
> pick: x.y.z.t, # servers = 1
> htdig: Run complete
> htdig: 1 server seen:
> htdig: x.y.z.t:8000 1 document

You should be getting much more output than that with a verbosity level of
7! Is it possible that there is a NUL byte in the document, soon after the
"<html>" tag? For some reason, htdig seems to be stopping right after this
tag, and not getting anywhere close to the other tags in the document. I've
tried it myself on the document you sent, and on that copy it worked fine.
The comment around the JavaScript code is correct, and htdig was able to
handle it. There must be something different in your copy of the document,
such as a NUL byte, which is causing htdig's parser to end prematurely.

> I think that htdig doesn't like the HTML code "<!--//" and "//-->", and
> it see beginning of comment but not the end and ignore the rest of HTML
> code of the page.
>
> I am true ? An other idea ? What can I do ?
>
> N.B. : The HTML code of the first page on the site is under this line.
> _________________________________________________________________
> <html>
>
> <head>
> <title>Accueil DIRECTION</title>
> <base target="rtop">
> <script language="JavaScript">
> <!--//
> var url="";
> var nom="";
> var bName="";
>
> function Ouvrir()
> {
> bName = navigator.appName
> Version = navigator.appVersion
> Version = Version.substring(0,1)
> browserOK = ((Version >= 2))
>
> if (browserOK)
> {
> this.name="home";
>
> msgWindow=window.open("actu/default2.htm","popupdpd","location=no,toolbar=no,status=no,directories=no,scrollbars=yes,width=400,height=450");
> bName=navigator.appName;
> if (bName=="Netscape") msgWindow.focus();
>
> }
> }
> Ouvrir()
>
> //-->
> </script>
> </head>
>
> <frameset framespacing="0" border="false" frameborder="0" cols="155,*">
> <frame name="gauche" scrolling="no" noresize target="haut_droite"
> src="defaulta.htm"
> marginwidth="0" marginheight="5">
> <frameset rows="*,45">
> <frame name="texte" target="bas_droite" src="defaultb.htm"
> scrolling="auto"
> marginwidth="0" marginheight="0" noresize>
> <frame name="bas" src="basac.htm" scrolling="no" marginwidth="7"
> marginheight="15"
> noresize>
> </frameset>
> <noframes>
> <body>
> <p>Cette page utilise des cadres, mais votre navigateur ne les prend
> pas en charge.</p>
> </body>
> </noframes>
> </frameset>
> </html>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Dec 07 2000 - 10:27:51 PST