Re: Double url references (fwd)


Steve Scott (sscott@uucom.com)
Mon, 20 Apr 1998 11:21:25 -0400 (EDT)


> To: andrew@contigo.com (Andrew Scherpbier)

Andrew,
        I did not hear back from you , but I have posted another message on the
mailing list: Could you tell me if I am using the allow_virtual hosts correctly?
You said to set it = false. Does this mean 0,false, FALSE? My most recent try is with
= 0 and this does not seem to work.
Thanks,
Steve.
Hi all,
        I continue to have problems digging with 3.08.02b. I have tried to
apply the patch that would look at local filesystems and only retrieve url's
that have unique inodes. And I have taken a suggestion to put the option
allow_virtual_hosts 0 in the htdig.conf file. However, I still will retrieve
the exact same pages with different url's because some of the hosts that I
am digging have two different domain names. My question for the group is has
anyone had success with the allow_virtual_hosts option in the htdig.conf file so that
multiple urls pointing to the same page do not appear?
Thanks,
Steve Scott

>
> Andrew,
> I added this variable to the htdig.conf file with this line:
> allow_virtual_hosts: false
> This does not seem to change anything, even after redigging my database.
> Do I need to recompile any executables? What about the patch Retriever.cc.0?
> Will this fix the problem also by denying two urls that have the same inode? I
> tried to apply the patch yesterday before I received your response. The problem
> with the patch is that it doesn't complie. It seems to error around the following code
> --- 442,478 ----
> url = u;
> url.lowercase();
>
> ! if ( visited.Exists(url) )
> ! return FALSE;
> !
> ! String *local_filename = IsLocal(u); // For local URL's, check
> ! if ( local_filename ) // list for device and inode
> ! { // to make sure we haven't
> ! struct stat buf; // already indexed a link
> ! // to this file.
> !
> The compiler complains about the IsLocal(u) variable. Where does this come from?
> Should it be IsValidUrl(u) instead? Also it says to put this on line 442 of the code, however this does not seem to be the correct line and there are two locations of the combination of code
> url = u;
> url.lowercase();
>
> in the latest version line 417 starts with :
> Retriever::Need2Get(char *u)
> {
> static String url;
> url = u;
> url.lowercase();
>
> return !visited.Exists(url);
> }
>
> and 456 starts with:
> static String url;
> url = u;
> url.lowercase();
>
> I tried the code is both places yesterday and failed to compile on IsLocal(u);
> I am not a C programmer , but I have experience in Perl, so I am trying to
> use logic to apply the patch the best I can. Any more suggestions?
> I thank you for your quick response yesterday,
> Steve Scott
> UUcom, Inc
> sscott@uucom.com
>
>
>
>
> >
> > Steve Scott [UUcom] wrote:
> > >
> > > Andrew,
> > > I have installed version 3.0.8b2 of htdig on as Sun Os
> > > 4.1.3_U1 machine. The previous version we had installed on this
> > > machine was htdig version 3.0.4. On our machine the server is
> > > known by three different urls. www.xxx.com aaa.com and name.xxx.com
> > > where aaa and xxx are actual names. Doing a htdig on the new version
> > > and then a search; results in urls returning with two of the different
> > > names, referencing the exact same page. Example of the search:
> > > www.xxx.com/dir/page1.html
> > > aaa.com/dir/page1.html
> > >
> > > I have tried to limit the urls returned by setting the htdig.conf
> > > parameter for exclude_urls to remove aaa.com. This seems to work, by
> > > removing the duplicate entries. Only problem is that I have over 1000
> > > web pages that use aaa.com and now are excluded from the htdig process.
> > >
> > > The older version 3.0.4 does not seem to have the problem of bringing
> > > back urls that reference the same page. Do you have any ideas on why
> > > this appears to be happening on 4.0.8b2 and not 3.0.4?
> > > Any suggestions would be greatly appreciated:
> > > Steve Scott
> > > sscott@uucom.com
> Andrew Scherpbier wrote:
> >
> > set 'allow_virtual_hosts' to false in the htdig.conf file. That will change
> > it back to the way it behaved before 3.0.8b2
> > Unfortunately, that attribute didn't make it into the docs, I noticed. Doh!
> >
> > --
> > Andrew Scherpbier <andrew@contigo.com>
> > Contigo Software <http://www.contigo.com/>
> >
>
>



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:02 PST