Subject: Re: [htdig3-dev] Stripping out all CGI query strings.
From: Patrick (email@example.com)
Date: Sun Mar 12 2000 - 00:36:49 PST
The goal is not to EXCLUDE the entire document from being indexed;
rather, just the query string (anything after the ?). I will take
your advice and look into URL::URL() and URL::parse()
At 12:30 PM 3/6/00 -0600, Gilles Detillieux wrote:
>According to Patrick:
>> Could someone give me some insight as to where I can begin
>> to write a patch that will allow the ability to "remove all
>> query string (anything after the '?') variables"?
>> My initial guess is within Retriever.cc, in the Retriever::Initial
>> function, immediately after:
>> url = u.get();
>> ..then, if a certain config setting is true, perform something
>> similar to the Perl equivalent of:
>> url =~ s/\?.*$//;
>> Any help is appreciated.
>Retriever::Initial only handles the initial URLs, i.e. in start_url
>or URLs already in the database for an update htdig. It won't handle
>newly followed href's. To get them all, maybe URL.cc is the best place
>for this. It already strips off the "#sectionname" portion of an URL,
>in URL::URL() and URL::parse().
>You may want to take a step back, though, and ask yourself why you want to
>to this. If your goal is simply to avoid indexing any URL with a query
>string, you can just add a ? to the exclude_urls attribute definition
>in your htdig.conf. Stripping off the query string is a pretty drastic
>step, as you'll still end up indexing all your CGI scripts (unless
>excluded by exclude_urls), but calling them all without a query string.
>It will also prevent you from being able to index any "virtual tree"
>of documents accessed by a query string, if you ever need to do this.
>Gilles R. Detillieux E-mail: <firstname.lastname@example.org>
>Spinal Cord Research Centre WWW:
>Dept. Physiology, U. of Manitoba Phone: (204)789-3766
>Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>To unsubscribe from the htdig3-dev mailing list, send a message to
>You will receive a message to confirm this.
To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Sun Mar 12 2000 - 00:42:46 PST