Re: Double url references


Steve Scott (sscott@uucom.com)
Wed, 8 Apr 1998 10:12:00 -0400 (EDT)


Andrew,
        I added this variable to the htdig.conf file with this line:
allow_virtual_hosts: false
This does not seem to change anything, even after redigging my database.
Do I need to recompile any executables? What about the patch Retriever.cc.0?
Will this fix the problem also by denying two urls that have the same inode? I
tried to apply the patch yesterday before I received your response. The problem
with the patch is that it doesn't complie. It seems to error around the following code
--- 442,478 ----
      url = u;
      url.lowercase();

! if ( visited.Exists(url) )
! return FALSE;
!
! String *local_filename = IsLocal(u); // For local URL's, check
! if ( local_filename ) // list for device and inode
! { // to make sure we haven't
! struct stat buf; // already indexed a link
! // to this file.
!
The compiler complains about the IsLocal(u) variable. Where does this come from?
Should it be IsValidUrl(u) instead? Also it says to put this on line 442 of the code, however this does not seem to be the correct line and there are two locations of the combination of code
 url = u;
      url.lowercase();

in the latest version line 417 starts with :
Retriever::Need2Get(char *u)
{
    static String url;
    url = u;
    url.lowercase();

    return !visited.Exists(url);
}

and 456 starts with:
 static String url;
    url = u;
    url.lowercase();

I tried the code is both places yesterday and failed to compile on IsLocal(u);
I am not a C programmer , but I have experience in Perl, so I am trying to
use logic to apply the patch the best I can. Any more suggestions?
I thank you for your quick response yesterday,
Steve Scott
UUcom, Inc
sscott@uucom.com

>
> Steve Scott [UUcom] wrote:
> >
> > Andrew,
> > I have installed version 3.0.8b2 of htdig on as Sun Os
> > 4.1.3_U1 machine. The previous version we had installed on this
> > machine was htdig version 3.0.4. On our machine the server is
> > known by three different urls. www.xxx.com aaa.com and name.xxx.com
> > where aaa and xxx are actual names. Doing a htdig on the new version
> > and then a search; results in urls returning with two of the different
> > names, referencing the exact same page. Example of the search:
> > www.xxx.com/dir/page1.html
> > aaa.com/dir/page1.html
> >
> > I have tried to limit the urls returned by setting the htdig.conf
> > parameter for exclude_urls to remove aaa.com. This seems to work, by
> > removing the duplicate entries. Only problem is that I have over 1000
> > web pages that use aaa.com and now are excluded from the htdig process.
> >
> > The older version 3.0.4 does not seem to have the problem of bringing
> > back urls that reference the same page. Do you have any ideas on why
> > this appears to be happening on 4.0.8b2 and not 3.0.4?
> > Any suggestions would be greatly appreciated:
> > Steve Scott
> > sscott@uucom.com
Andrew Scherpbier wrote:
>
> set 'allow_virtual_hosts' to false in the htdig.conf file. That will change
> it back to the way it behaved before 3.0.8b2
> Unfortunately, that attribute didn't make it into the docs, I noticed. Doh!
>
> --
> Andrew Scherpbier <andrew@contigo.com>
> Contigo Software <http://www.contigo.com/>
>



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:01 PST