Re: htdig: Pages get indexed several times


Warren Jones (wjones@tc.fluke.com)
Mon, 1 Dec 1997 09:08:32 -0800


Martin Berli writes:

> I noted a problem with htdig: When indexing a site, it doesn't see
> the symlinks (coming via http), so search results return the same
> pages more than once.

If you're getting web pages from a remote HTTP server, there's no
way that htdig could recognize a symlink. I think there has been
some discussion in this list of keeping track of checksums for all
web pages indexed, in order to avoid duplicates, but this is not
quite the same problem as avoiding symlinks (and as far as I know,
no one has implemented the checksum idea.)

However, if you're getting web pages via a local file system,
recognizing symlinks is considerably easier. I'm enclosing a patch
against version 3.0.8b2 that avoids symlinks (or hard links) by
keeping track of the device and inode of each page indexed.
Note that this will only work if you use the "local_urls" feature
of version 3.0.8b2.

--------------------------------------------------------------------
Warren Jones | To keep every cog and wheel is the first
Fluke Corporation | precaution of intelligent tinkering.
Everett, Washington, USA | -- Aldo Leopold
--------------------------------------------------------------------

Index: Retriever.cc
===================================================================
RCS file: /usr0/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v
retrieving revision 1.3
diff -c -r1.3 Retriever.cc
*** Retriever.cc 1997/09/04 21:18:40 1.3
--- Retriever.cc 1997/12/01 17:03:29
***************
*** 35,40 ****
--- 35,41 ----
  #include "Parsable.h"
  #include "Document.h"
  #include <StringList.h>
+ #include <sys/stat.h>
  
  static WordList words;
  
***************
*** 444,450 ****
      url = u;
      url.lowercase();
  
! return !visited.Exists(url);
  }
  
  
--- 445,481 ----
      url = u;
      url.lowercase();
  
! if ( visited.Exists(url) )
! return FALSE;
!
! String *local_filename = IsLocal(u); // For local URL's, check
! if ( local_filename ) // list for device and inode
! { // to make sure we haven't
! struct stat buf; // already indexed a link
! // to this file.
!
! if ( stat(local_filename->get(),&buf) == 0 )
! {
! char key[2*sizeof(ino_t)+2*sizeof(dev_t)+2]; // Make hash key
! sprintf( key, "%x+%x", buf.st_dev, buf.st_ino ); // from device
! if ( visited.Exists(key) ) // and inode.
! {
! if ( debug ) {
! String *dup = (String*)visited.Find(key);
! cout << endl
! << "Duplicate: " << local_filename->get()
! << " -> " << dup->get() << endl;
! }
! delete local_filename;
! return FALSE;
! }
! visited.Add(key,local_filename);
! return TRUE;
! }
! delete local_filename;
! }
! return TRUE;
!
  }
  
  
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:23 PST