Re: [htdig3-dev] Very Basic Question


Torsten Neuer (tneuer@inwise.de)
Mon, 17 May 1999 15:05:29 +0200


According to Stankard, Michael L.:
>Hello -
>We are trying to implement ht://Dig for our corporate Intranet; the
>person who was doing it has left, leaving me to deal with the
>implementation. The problem is the tool works too well; although he
>assured me the tool used "spidering" of active links to find documents
>to search, it seems to be finding everything that is on the server,
>leading to some interesting responses.
>
>Is it true that ht://Dig should only find documents with active links to
>the page you start spidering?
>If that is true, what else could cause it to find all these additional
>documents (older versions of pages, etc.) ?
>thnx MLS

ht://Dig will only reveal those pages in search results which are
reachable from the start URL. However, such a document might also
be an automatically generated directory index which is returned by
a server if no index document is available for a specific directory.
So if any of your documents refers to just a directory, the digger
will get the servers directory index and follow those links, too.

If your intranet is large, it might be quite difficult to find out
what exactly is causing this. If you have a server-side programming
language (hmm.. server-side includes might work as well) installed,
you could write a dummy index document which just returns some
information about itself and about the referring document along
with a unique searchable term (anything that might come into your
mind and which is unlikely to be in the database). Then re-index
the server and look for this search term. The search result should
now point you to the erroneous documents. Correct the documents if
necessary (i.e. if the link is wrong) or put the respective directory
in the list of URLs that ht://Dig should omit, re-index and search
again until no more of the dummy pages are found are found.
After this cleanup you should remove the dummy files again.

You could also try to turn of the generation of server generated
directory index documents, but that might cause trouble where
people are relying on that feature.

hth,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon May 17 1999 - 06:33:31 PDT