Subject: [htdig] htdig 3.1.5 directory parent prune feature: solution ?
From: Peter L. Peres (firstname.lastname@example.org)
Date: Tue May 02 2000 - 14:35:05 PDT
what's the best place to implement this feature:
source, under htdig (indentation denotes call chart order):
ExternalParser::parse // at "case'u'", "case 'm'"
HTML::parse // at "if(dofollow)", several places.
The trick is, that the canonicalization is done in got_href but to
implement the feature the parent document URL is needed in canonical form.
The easy way is to add an argument to got_href to pass the canonical
parent URL to got_href, and implement the function in got_href.
However, the canonical base URL needs to be pre-parsed for easy use of the
substring matching algorythm (is it ?), so maybe a modification will be
made to the canonicalization code proper, to do it there, once, and pass
the parsed result as public data of some class.
HTML::do_tag also knows nothing about the parent name ?
I think that the special pre-parsing should be done in HTML::parse and
data be stored in a public data member of HTML::, then used in got_href()
after the canonicalization of the new URL will be done there, to call a
new member function of Retriever:: that will ok or prune the URL wrt the
feature to be implemented.
The special case of the 'first' URL on a server must also be handled,
although it should never appear (as it is injected directly via push()
and not with got_href() ?).
Opinions on how it's best to do this ?
PS: wrt the feature, redescribed:
If a document with URL /a/b/c contains a href that is an exact substring
of /a/b/c, such as /a/b or /a, then that href should be ignored and
removed from the URLs to be parsed (push()-ed).
* a good name for a config option that turns this feature on
* should the pruned URL appear in the URL list in despite of its not being
* what is a good strategy to match a string (list) of tokens separated by
'/' backwards. This: ?
match last char || fail
while more parts
match last part || fail
last = prev(last)
* other ?
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue May 02 2000 - 12:16:22 PDT