htdig: Partial success :) Re: Rewriting URL's in db.docs.index possible?


Doug (DougB@simplenet.com)
Thu, 14 Jan 1999 18:03:23 -0800


Doug wrote:

> For
> a variety of reasons, after you access our site for the first time you
> are redirected from "www.simplenet.com" to "www1.simplenet.com." Part of
> the reason for this is to append a tracking number, like
> ...html?000.lotsmorenumbershere. This isn't the end of the world,
> however what we'd really like to do is rewrite indexed url's of the form
> "http://www1.simplenet.com/path/file.html?" to
> "http://www.simplenet.com/path/file.html".

        Ok, I'm partway home on this. The attached patch strips off the ? at
the end for all URL's. I include it in case anyone else is dealing with
a similar problem. Perhaps some enterprising individual might want to
make a configuration option thinger for
"stuff_to_strip_off_the_end_of_URL's if they were bored. :)

        My (very imperfect) reading of the code indicates that I don't actually
want to change the "www1" to "www" before it goes into the db because if
I do htdig will think that it always has to look up urls that start with
www1. even if it's already seen it before. Soooo.. I'm back to changing
the db AFTER it's written out (but before I run htmerge?). I'm still
open to suggestions on that bit.

Thanks a bunch for all the previous help/suggestions,

Doug

*** htdig-3.1.0b4/htlib/URL.cc Thu Dec 24 09:20:20 1998
--- htdig-3.1.0b4-no-quest/htlib/URL.cc Thu Jan 14 17:03:44 1999
***************
*** 137,142 ****
--- 137,145 ----
      // Thanks goes to David Filiatrault <dwf@WebThreads.Com> for suggesting
      // this removal process.
      //
+
+ char *stupid_question_mark = strchr(ref, '?');
+
      char *anchor = strchr(ref, '#');
      char *params = strchr(ref, '?');
      if (anchor)
***************
*** 155,160 ****
--- 158,171 ----
          }
      }
  
+
+ // Well, if that works for anchors will it work for the stinkin' ?'s
+ if (stupid_question_mark)
+ {
+ *stupid_question_mark = '\0';
+ }
+
+
      //
      // If, after the removal of a possible '#' we have nothing left,
      // we just want to use the base URL.
***************
*** 277,285 ****
--- 288,303 ----
      // Ignore any part of the URL that follows the '#' since this is just
      // an index into a document.
      //
+
+ char *stupid_question_mark = strchr(nurl, '?');
      char *p = strchr(nurl, '#');
      if (p)
          *p = '\0';
+
+ if (stupid_question_mark)
+ {
+ *stupid_question_mark = '\0';
+ }
          
      //
      // Extract the service

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Fri Jan 15 1999 - 08:31:57 PST