RE: [htdig] htdig

Subject: RE: [htdig] htdig
From: Richard Bethany (
Date: Thu Jan 11 2001 - 09:45:16 PST


That was my fear as well. For the one link below with eight menu items, I
need to accept p=1: through p=8: to pick up any/all links in the submenus,
but I would have to reject the other 40,312 possible combinations of values
that "p" can have. As you stated, that would be a mite cumbersome and, if
we had pages with more menu items (we do), it would become exponentially
more impossible (<-- can something be "more" impossible? How about more
improbable?) to limit the accepted values.

Does the 3.2 beta release seem pretty stable? Does the regex functionality
work properly? If so, perhaps I'll give that a shot. If not, I suppose
I'll just dig around in the code to see if I can find a way to get it to do
what we need.

Thanks for your input, Gilles!! Thanks to you too, Geoff!!
Richard Bethany
S1 Corporation

-----Original Message-----
From: Gilles Detillieux []
Sent: Thursday, January 11, 2001 12:13 PM
Cc: Richard Bethany;
Subject: Re: [htdig] htdig

According to Geoff Hutchison:
> No regular expressions needed. You can limit URLs based on query patterns
> already. See the bad_querystr attribute:
> <>
> On Thu, 11 Jan 2001, Richard Bethany wrote:
> > I'm the SysAdmin for our web servers and I'm working with Chuck (who
> > the development work) on this problem. Here's the "nuts & bolts" of the
> > problem. Our entire web server is set up with a menuing system being
> > through PHP3. This menuing system basically allows local
documents/links to
> > be reached via a URL off of the PHP3 file. In other words, if I try to
> > access a particular page it will be accessed as
> >
> >
> > In this scenario the only relevant piece of info is the "i" value; the
> > remainder of the info simply describes which portions of the menu should
> > displayed. What ends up happening is that, for a page with eight(8)
> > menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig
> > each link!! I essentially need to exclude any URL where "p" has more
> > one value (i.e. - &p=1: is okay, &p=1:2: is not).
> >
> > I've looked through the mailing list archives and found a great deal of
> > discussion on the topic of regular expressions with exclusions and also
> > talk of stripping parts of the URL, but I've seen nothing to indicate
> > any of this has actually been implemented. Do you know if there is any
> > implementation of this? If not, I saw a reply to a different problem
> > Gilles indicating that the URL::normalizePath() function would be the
> > place to start hacking so I guess I'll try that.

I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly. The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
"?" in the URL.

So, if &p=1: is valid, but &p=[2-9].* and &p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:

bad_querystr: &p=2 &p=3 &p=4 &p=5 &p=6 &p=7 &p=8 &p=9 \
                &p=1:2 &p=1:3 &p=1:4 &p=1:5 &p=1:6 &p=1:7 &p=1:8 &p=1:9

It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow &p=1: but not &p=1[0-9]:,
so you'd need to include these patterns in the list too:

        &p=10 &p=11 &p=12 &p=13 &p=14 &p=15 &p=16 &p=17 &p=18 &p=19 &p=1:1

So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex. This will be easier in 3.2,
which will allow regular expressions.

I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns. That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

This archive was generated by hypermail 2b28 : Thu Jan 11 2001 - 10:00:10 PST