RE: [htdig] htdig


Subject: RE: [htdig] htdig
From: Richard Bethany (richard.bethany@s1.com)
Date: Thu Jan 11 2001 - 09:45:16 PST


Gilles,

That was my fear as well. For the one link below with eight menu items, I
need to accept p=1: through p=8: to pick up any/all links in the submenus,
but I would have to reject the other 40,312 possible combinations of values
that "p" can have. As you stated, that would be a mite cumbersome and, if
we had pages with more menu items (we do), it would become exponentially
more impossible (<-- can something be "more" impossible? How about more
improbable?) to limit the accepted values.

Does the 3.2 beta release seem pretty stable? Does the regex functionality
work properly? If so, perhaps I'll give that a shot. If not, I suppose
I'll just dig around in the code to see if I can find a way to get it to do
what we need.

Thanks for your input, Gilles!! Thanks to you too, Geoff!!
Richard Bethany
S1 Corporation

-----Original Message-----
From: Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
Sent: Thursday, January 11, 2001 12:13 PM
To: ghutchis@wso.williams.edu
Cc: Richard Bethany; htdig@htdig.org
Subject: Re: [htdig] htdig

According to Geoff Hutchison:
> No regular expressions needed. You can limit URLs based on query patterns
> already. See the bad_querystr attribute:
> <http://www.htdig.org/attrs.html#bad_querystr>
...
> On Thu, 11 Jan 2001, Richard Bethany wrote:
> > I'm the SysAdmin for our web servers and I'm working with Chuck (who
does
> > the development work) on this problem. Here's the "nuts & bolts" of the
> > problem. Our entire web server is set up with a menuing system being
run
> > through PHP3. This menuing system basically allows local
documents/links to
> > be reached via a URL off of the PHP3 file. In other words, if I try to
> > access a particular page it will be accessed as
> > http://ourweb.com/DEPT/index.php3?i=1&e=3&p=2:3:4:.
> >
> > In this scenario the only relevant piece of info is the "i" value; the
> > remainder of the info simply describes which portions of the menu should
be
> > displayed. What ends up happening is that, for a page with eight(8)
main
> > menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig
for
> > each link!! I essentially need to exclude any URL where "p" has more
than
> > one value (i.e. - &p=1: is okay, &p=1:2: is not).
> >
> > I've looked through the mailing list archives and found a great deal of
> > discussion on the topic of regular expressions with exclusions and also
some
> > talk of stripping parts of the URL, but I've seen nothing to indicate
that
> > any of this has actually been implemented. Do you know if there is any
> > implementation of this? If not, I saw a reply to a different problem
from
> > Gilles indicating that the URL::normalizePath() function would be the
best
> > place to start hacking so I guess I'll try that.

I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly. The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
"?" in the URL.

So, if &p=1: is valid, but &p=[2-9].* and &p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:

bad_querystr: &p=2 &p=3 &p=4 &p=5 &p=6 &p=7 &p=8 &p=9 \
                &p=1:2 &p=1:3 &p=1:4 &p=1:5 &p=1:6 &p=1:7 &p=1:8 &p=1:9

It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow &p=1: but not &p=1[0-9]:,
so you'd need to include these patterns in the list too:

        &p=10 &p=11 &p=12 &p=13 &p=14 &p=15 &p=16 &p=17 &p=18 &p=19 &p=1:1

So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex. This will be easier in 3.2,
which will allow regular expressions.

I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns. That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930



This archive was generated by hypermail 2b28 : Thu Jan 11 2001 - 10:00:10 PST