Re: htdig: Duplicate files with unique URLs


Andrew Scherpbier (andrew@contigo.com)
Wed, 10 Dec 1997 16:28:27 -0800


Geoff Hutchison wrote:
>
> >I think that should be done in htlib/URL.cc. There is already some code to
> >deal with "/../" in URLs. Removal of "//" shouldn't be too hard.
>
> Wasn't there also talk of keeping a checksum or something for each file and
> only keeping one copy? In this particular situation this seems very easy,
> but I can think of plenty of other situations where it's not so easy to
> detect a duplicate file with a unique url.

Yes. That would a very good solution.

> While we're at it, though, I suggest some way of stripping off strings like
> "?D-A" used by Apache's new (1.3) directory sorting feature. I think this
> would probably go in htlib/URL.cc as well.

I thought I was already stripping out URL parameters after a "?"...

Anyway, attached is a diff against htlib/URL.{cc,h} that will remove the
double slashes from the path. I tested it briefly and it seemed to work, but
please test this some more!

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>

Index: htlib/URL.cc =================================================================== RCS file: /opt/src/cvs/htdig/htlib/URL.cc,v retrieving revision 1.5 diff -c -r1.5 URL.cc *** URL.cc 1997/07/07 21:23:43 1.5 --- URL.cc 1997/12/11 00:24:45 *************** *** 214,250 **** // } } ! // ! // We now need to take care of situations where the URL contains ! // relative parts ("/../") ! // We will rewrite the path to be the minimal. ! // ! int i, limit; ! while ((i = _path.indexOf("/../")) >= 0) ! { ! if ((limit = _path.lastIndexOf('/', i - 1)) >= 0) ! { ! String newPath; ! newPath << _path.sub(0, limit).get(); ! newPath << _path.sub(i + 3).get(); ! _path = newPath; ! } ! else ! { ! _path = _path.sub(i + 3).get(); ! } ! } ! // ! // Also get rid of redundent "/./". This could cause infinite ! // loops. ! // ! while ((i = _path.indexOf("/./")) >= 0) ! { ! String newPath; ! newPath << _path.sub(0, i).get(); ! newPath << _path.sub(i + 2).get(); ! _path = newPath; ! } } } --- 214,224 ---- // } } ! ! // ! // Get rid of loop-causing constructs in the path ! // ! normalizePath(); } } *************** *** 334,340 **** // _path = "/"; _path << strtok(0, "\n"); ! // // Build the url. (Note, the host name has NOT been normalized!) // --- 308,319 ---- // _path = "/"; _path << strtok(0, "\n"); ! ! // ! // Get rid of loop-causing constructs in the path ! // ! normalizePath(); ! // // Build the url. (Note, the host name has NOT been normalized!) // *************** *** 345,350 **** --- 324,379 ---- _url << _path; } + + //***************************************************************************** + // void URL::normalizePath() + // + void URL::normalizePath() + { + // + // We now need to take care of situations where the URL contains + // relative parts ("/../") + // We will rewrite the path to be the minimal. + // + int i, limit; + while ((i = _path.indexOf("/../")) >= 0) + { + if ((limit = _path.lastIndexOf('/', i - 1)) >= 0) + { + String newPath; + newPath << _path.sub(0, limit).get(); + newPath << _path.sub(i + 3).get(); + _path = newPath; + } + else + { + _path = _path.sub(i + 3).get(); + } + } + + // + // Also get rid of redundent "/./". This could cause infinite + // loops. + // + while ((i = _path.indexOf("/./")) >= 0) + { + String newPath; + newPath << _path.sub(0, i).get(); + newPath << _path.sub(i + 2).get(); + _path = newPath; + } + + // + // Furthermore, get rid of "//". This could also cause loops + // + while ((i = _path.indexOf("//")) >= 0) + { + String newPath; + newPath << _path.sub(0, i).get(); + newPath << _path.sub(i + 1).get(); + _path = newPath; + } + } //***************************************************************************** // void URL::dump() Index: htlib/URL.h =================================================================== RCS file: /opt/src/cvs/htdig/htlib/URL.h,v retrieving revision 1.2 diff -c -r1.2 URL.h *** URL.h 1997/03/24 04:33:22 1.2 --- URL.h 1997/12/11 00:24:45 *************** *** 57,62 **** --- 57,63 ---- String _signature; void removeIndex(String &); + void normalizePath(); };

---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:24 PST