Re: [htdig] htdig


Subject: Re: [htdig] htdig
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Jan 11 2001 - 09:13:24 PST


According to Geoff Hutchison:
> No regular expressions needed. You can limit URLs based on query patterns
> already. See the bad_querystr attribute:
> <http://www.htdig.org/attrs.html#bad_querystr>
...
> On Thu, 11 Jan 2001, Richard Bethany wrote:
> > I'm the SysAdmin for our web servers and I'm working with Chuck (who does
> > the development work) on this problem. Here's the "nuts & bolts" of the
> > problem. Our entire web server is set up with a menuing system being run
> > through PHP3. This menuing system basically allows local documents/links to
> > be reached via a URL off of the PHP3 file. In other words, if I try to
> > access a particular page it will be accessed as
> > http://ourweb.com/DEPT/index.php3?i=1&e=3&p=2:3:4:.
> >
> > In this scenario the only relevant piece of info is the "i" value; the
> > remainder of the info simply describes which portions of the menu should be
> > displayed. What ends up happening is that, for a page with eight(8) main
> > menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for
> > each link!! I essentially need to exclude any URL where "p" has more than
> > one value (i.e. - &p=1: is okay, &p=1:2: is not).
> >
> > I've looked through the mailing list archives and found a great deal of
> > discussion on the topic of regular expressions with exclusions and also some
> > talk of stripping parts of the URL, but I've seen nothing to indicate that
> > any of this has actually been implemented. Do you know if there is any
> > implementation of this? If not, I saw a reply to a different problem from
> > Gilles indicating that the URL::normalizePath() function would be the best
> > place to start hacking so I guess I'll try that.

I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly. The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
"?" in the URL.

So, if &p=1: is valid, but &p=[2-9].* and &p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:

bad_querystr: &p=2 &p=3 &p=4 &p=5 &p=6 &p=7 &p=8 &p=9 \
                &p=1:2 &p=1:3 &p=1:4 &p=1:5 &p=1:6 &p=1:7 &p=1:8 &p=1:9

It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow &p=1: but not &p=1[0-9]:,
so you'd need to include these patterns in the list too:

        &p=10 &p=11 &p=12 &p=13 &p=14 &p=15 &p=16 &p=17 &p=18 &p=19 &p=1:1

So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex. This will be easier in 3.2,
which will allow regular expressions.

I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns. That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Jan 11 2001 - 09:27:19 PST