Re: htdig: Digging both internal and external sites


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 3 Dec 1998 15:57:13 -0600 (CST)


According to Frank Guangxin Liu:
>
> What I did is
> 1) setup an internal proxy/cache server using "squid".
> 2) configure "squid" so that it connects directly to Intranet hosts
> and uses firewall proxy for Internet hosts.
> 3) tell htdig to use "squid" for all hosts.

I hadn't thought of that! I couldn't resist the challenge, though, and as
it's a quiet day I dove in and came up with my own patch, included below.
It's to be applied to the 3.1.0b2 sources, and includes Geoff's fix
from Nov. 8 for not deleting the proxy name prematurely. With a bit
of fiddling, it could also be applied to the current CVS source tree,
I imagine, but I haven't grabbed a copy of it yet. Probably deleting
the first couple sections from the Document.cc diffs below would do it,
unless this file has seen a lot of recent changes.

Denis, give this a try, with a line in your htdig.conf that specifies
what to exclude, e.g.:

http_proxy_exclude: http://my.intranet.com/ \
        http://our.other.intranet.com/

It uses a case-insensitive substring match, so you could even just specify
a domain, if that domain isn't going to show up anywhere else in URLs that
should go through the proxy. E.g. ".intranet.com/" would exclude anything
that starts "http://SOMETHING.intranet.com/...", but also something like
"http://outside.our.net/blah/stuff/from/other.intranet.com/hello.html".
In other words, it works like the exclude_urls parameter.

Here's the patch. I tested the logic of UseProxy(), but I didn't have
a proxy server to test everything thoroughly, so let me know if it works.

--- htdig-3.1.0b2/htcommon/defaults.cc.proxy Mon Nov 2 18:21:51 1998
+++ htdig-3.1.0b2/htcommon/defaults.cc Thu Dec 3 13:42:13 1998
@@ -119,6 +119,7 @@
     {"heading_factor_6", "0"},
     {"htnotify_sender", "webmaster@www"},
     {"http_proxy", ""},
+ {"http_proxy_exclude", ""},
     {"image_list", "${database_base}.images"},
     {"iso_8601", "false"},
     {"keywords_factor", "100"},
--- htdig-3.1.0b2/htdig/Document.h.proxy Mon Nov 2 18:21:51 1998
+++ htdig-3.1.0b2/htdig/Document.h Thu Dec 3 15:45:05 1998
@@ -132,6 +132,7 @@
 
     int readHeader(Connection &);
     time_t getdate(char *datestring);
+ int UseProxy();
 };
 
 #endif
--- htdig-3.1.0b2/htdig/Document.cc.proxy Mon Nov 2 18:21:51 1998
+++ htdig-3.1.0b2/htdig/Document.cc Thu Dec 3 15:45:12 1998
@@ -167,9 +167,6 @@
     if (!url)
       delete url;
     url = 0;
- if (!proxy)
- delete proxy;
- proxy = 0;
     referer = 0;
     if(config.Boolean("modification_time_is_now"))
        modtime = time(NULL);
@@ -182,6 +179,10 @@
 
     // Don't reset the authorization since it's a pain to set up again.
     // authorization = 0;
+ // Don't reset the proxy since it's a pain to set up too.
+ // if (!proxy)
+ // delete proxy;
+ // proxy = 0;
 }
 
 
@@ -305,6 +306,43 @@
 
 
 //*****************************************************************************
+// int Document::UseProxy()
+// Returns 1 if the given url is to be retrieved from the proxy server,
+// or 0 if it's not.
+//
+int
+Document::UseProxy()
+{
+ static StringMatch *excludeproxy = 0;
+
+ //
+ // Initialize excludeproxy list if this is the first time.
+ //
+ if (!excludeproxy)
+ {
+ excludeproxy = new StringMatch();
+
+ String t = config["http_proxy_exclude"];
+ String pattern;
+ char *p = strtok(t, " \t");
+ while (p)
+ {
+ if (pattern.length())
+ pattern << '|';
+ pattern << p;
+ p = strtok(0, " \t");
+ }
+ excludeproxy->IgnoreCase();
+ excludeproxy->Pattern(pattern);
+ }
+
+ if (proxy && excludeproxy->FindFirst(url->get()) < 0)
+ return 1;
+ return 0;
+}
+
+
+//*****************************************************************************
 // DocStatus Document::RetrieveHTTP(time_t date)
 // Attempt to retrieve the document pointed to by our internal URL
 //
@@ -315,7 +353,8 @@
     if (c.open() == NOTOK)
         return Document_not_found;
 
- if (proxy)
+ int useproxy = UseProxy();
+ if (useproxy)
     {
         if (c.assign_port(proxy->port()) == NOTOK)
             return Document_not_found;
@@ -335,6 +374,10 @@
         if (debug > 1)
         {
             cout << "Unable to build connection with " << url->host() << ':' << url->port() << endl;
+ if (useproxy)
+ {
+ cout << "(Via proxy " << proxy->host() << ':' << proxy->port() << ')' << endl;
+ }
         }
         return Document_no_server;
     }
@@ -344,7 +387,7 @@
     //
     String command = "GET ";
 
- if (proxy)
+ if (useproxy)
     {
         command << url->get() << " HTTP/1.0\r\n";
     }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:46 PST