Re: [htdig3-dev] Patch to avoid duplicate local URL's


Subject: Re: [htdig3-dev] Patch to avoid duplicate local URL's
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Feb 28 2000 - 08:48:35 PST


According to Joe R. Jah:
> I installed htdig3.1.5 yesterday. Everything works fine;) many thanks to
> all developers and contributors.
>
> I still have many symlinks in my site that cause an unacceptable number of
> duplicate URL's in htdig search results; therefore, I'd like to use the
> patch that was originally written by Warren:
>
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0
>
> and later adopted by Gilles and modified by me:
>
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.1.4/Retriever.cc.0
>
> and later reposted by Warren:
>
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.1.4/Retriever.cc.1
>
> The latter two, (practically identical,) were made for 3.1.4, but they
> apply perfectly to 3.1.5; however, compilation stops with the following
> error:
> _______________________________________________________________________
> [snip]
> Retriever.cc: In method `int Retriever::Need2Get(char *)':
> Retriever.cc:611: type `String' is not a base type for type `StringList'
> gmake[1]: *** [Retriever.o] Error 1
> gmake[1]: Leaving directory `/home/jjah/tmp/htdig-3.1.5/htdig'
> gmake: *** [all] Error 1
> _______________________________________________________________________
>
>
> How can the patch be modified, (patched;), to work with 3.1.5? I'd like
> to place the patch in 3.1.5 folder of the unofficial patch site for others
> who are in my situation.

I was afraid of this. I realised that the changes to GetLocal(), to handle
multiple file names, would almost certainly break this, but I didn't have
the time last week to look into this. Please try this patch. It compiles,
but I haven't actually tested it, as I don't have hrefs to symlinks on my
site.

For a couple alternative methods of suppressing symbolic links in the index,
have a look at this e-mail message in the htdig@htdig.org archives:

        http://www.htdig.org/mail/2000/02/0085.html

*** htdig/Retriever.cc.orig Thu Feb 24 20:29:10 2000
--- htdig/Retriever.cc Mon Feb 28 10:33:05 2000
***************
*** 18,23 ****
--- 18,25 ----
  #include <signal.h>
  #include <assert.h>
  #include <stdio.h>
+ #include <sys/types.h>
+ #include <sys/stat.h>
  #include "HtWordType.h"
  
  static WordList words;
*************** Retriever::Need2Get(char *u)
*** 603,609 ****
      static String url;
      url = u;
  
! return !visited.Exists(url);
  }
  
  
--- 605,655 ----
      static String url;
      url = u;
  
! if (visited.Exists(url))
! return FALSE;
!
! StringList *local_filenames = GetLocal(u);
! if (!local_filenames)
! return TRUE;
!
! //
! // For local URL's, check list for device and inode to make
! // sure we haven't already indexed a link to this file.
! //
! struct stat buf;
! String *file;
!
! local_filenames->Start_Get();
! while ((file = (String *)local_filenames->Get_Next()) &&
! ((stat(*file, &buf) == -1) || !S_ISREG(buf.st_mode)))
! ;
! if (!file)
! {
! delete local_filenames;
! return TRUE;
! }
!
! //
! // Make hash key from device and inode:
! //
! char key[2*sizeof(ino_t)+2*sizeof(dev_t)+2];
! sprintf(key, "%x+%x", buf.st_dev, buf.st_ino);
!
! if (!visited.Exists(key))
! {
! visited.Add(key, new String(*file));
! delete local_filenames;
! return TRUE;
! }
!
! if (debug)
! {
! String *dup = (String*)visited.Find(key);
! cout << endl << "Duplicate: " << *file << " -> " << dup->get() << endl;
! }
!
! delete local_filenames;
! return FALSE;
  }
  
  

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Feb 28 2000 - 08:52:46 PST