[htdig] htdig 3.1.5 directory parent prune feature: solution ?


Subject: [htdig] htdig 3.1.5 directory parent prune feature: solution ?
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue May 02 2000 - 14:35:05 PDT


Hi,

what's the best place to implement this feature:

source, under htdig (indentation denotes call chart order):

ExternalParser::parse // at "case'u'", "case 'm'"
HTML::parse // at "if(dofollow)", several places.
  Retriever::got_href()
    Server::push()

The trick is, that the canonicalization is done in got_href but to
implement the feature the parent document URL is needed in canonical form.

The easy way is to add an argument to got_href to pass the canonical
parent URL to got_href, and implement the function in got_href.

However, the canonical base URL needs to be pre-parsed for easy use of the
substring matching algorythm (is it ?), so maybe a modification will be
made to the canonicalization code proper, to do it there, once, and pass
the parsed result as public data of some class.

HTML::do_tag also knows nothing about the parent name ?

I think that the special pre-parsing should be done in HTML::parse and
data be stored in a public data member of HTML::, then used in got_href()
after the canonicalization of the new URL will be done there, to call a
new member function of Retriever:: that will ok or prune the URL wrt the
feature to be implemented.

The special case of the 'first' URL on a server must also be handled,
although it should never appear (as it is injected directly via push()
and not with got_href() ?).

Opinions on how it's best to do this ?

tia,

        Peter

PS: wrt the feature, redescribed:

If a document with URL /a/b/c contains a href that is an exact substring
of /a/b/c, such as /a/b or /a, then that href should be ignored and
removed from the URLs to be parsed (push()-ed).

Questions:

* a good name for a config option that turns this feature on
* should the pruned URL appear in the URL list in despite of its not being
followed ?
* what is a good strategy to match a string (list) of tokens separated by
'/' backwards. This: ?

match last char || fail
while more parts
  match last part || fail
  last = prev(last)

* other ?

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 12:16:22 PDT