[htdig] [PATCH] fix valid_extensions handling bugs


Subject: [htdig] [PATCH] fix valid_extensions handling bugs
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Feb 01 2000 - 07:19:30 PST


According to fx:
> it s very strange ...
> I show you my conf
> -------------------------------------------------------------
> database_dir: /home/web/inerd/htdig/db
> database_base: ${database_dir}/inerd
> #allow_virtual_hosts: true
> valid_extensions: .html .htm .shtml .php .php3 .asp .php
> start_url: http://192.168.0.2
> limit_urls_to: http://192.168.0.2
> exclude_urls: /cgi-bin/ .cgi
> bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif\
> .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi
> maintainer: inerd
> max_head_length: 10000
> max_doc_size: 200000
> no_excerpt_show_top: false
> search_algorithm: exact:1 synonyms:0.5 endings:0.1
> search_results_wrapper: /home/web/inerd/www/htdig/wrapper_inerd.html
> nothing_found_file: /home/web/inerd/www/htdig/nomatch_inerd.html
> ----------------------------------------------------
> the result of the htdig -i -vvv
>
> ...
> pushing http://192.168.0.2/index.php3
> +A tag: pos = 2, position = =/news/index.php3?idnews=3 class=news>
> href: http://192.168.0.2/news/index.php3?idnews=3 (La troisième)
>
> Rejected: Extension is not valid!

This error, just as the one below, indicates the URL is rejected because
it doesn't fit any of the patterns in valid_extensions. Unfortunately,
the pattern matching doesn't take CGI parameters into account, so the
match fails. I think this is a bug, which the patch below should fix.

> ...
>
> ...
> *A tag: pos = 2, position = ="/services" class="navig1">
> href: http://192.168.0.2/services (services)
>
> Rejected: Extension is not valid!

In this case, the URL is rejected because of a bug in the new
valid_extensions attribute handling, as was pointed out by Warren
Jones about a month ago.

> ...
>
> do you have any suggestion ?
> (I ve really tried a lot of things ... a real mystery)
>
> thanx
>
> ps : I use 3.1.4
> and my directory index is good :
> DirectoryIndex index.html index.htm index.shtml index.cgi index.php3

Here is a patch which I hope will fix both problems. Please let me know
if it works.

--- htdig/Retriever.cc.valextbug Thu Dec 9 18:28:44 1999
+++ htdig/Retriever.cc Tue Feb 1 09:16:04 2000
@@ -702,9 +702,14 @@ Retriever::IsValidURL(char *u)
     //
     char *ext = strrchr(url, '.');
     String lowerext;
+ if (ext && strchr(ext, '/')) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
     if (ext)
       {
         lowerext = ext;
+ int parm = lowerext.indexOf('?'); // chop off URL parameter
+ if (parm >= 0)
+ lowerext.chop(lowerext.length() - parm);
         lowerext.lowercase();
         if (invalids->Exists(lowerext))
           {

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Feb 01 2000 - 07:21:16 PST