Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu May 11 2000 - 12:38:58 PDT


According to Peter L. Peres:
> On Thu, 11 May 2000, Gilles Detillieux wrote:
> >indexing to /doc/javadoc, though, why not just set limit_urls_to
> >to http://my.host.domain/doc/javadoc and not add anything else to
...
> Their documentation part is linked to from a large page under
> DocumentRoot. I want htdig to start there and follow those links, and NOT
> climb up to the source and binaries and other things in the packages,

OK, I now have a much clearer picture in my mind of why you ran into the
problems you reported, and why the solution you came up with does make
sense in this situation. It was hard to get a view of the big picture
from one tiny snapshot at a time.

> >I think that sort of pathname optimisation forduplicate elimination
> >would be an excellent idea. I'm just not sure how you'd implement it.
> >The 3.1.x series uses a different database structure, so anything you'd
> >work out for it wouldn't transport easily to 3.2. As 3.1.x is strictly in
> >maintenance mode now, I'd recommend that if you're serious about this, you
> >implement it for 3.2 only. There, the db.docdb database has a field for
> >each file that contains the URL, so it should be possible to change that
> >field. The tricky part is that there is also the db.docs.index database,
> >which maps URLs to DocIDs, which are used as keys for the db.docdb. So,
> >you'd need to remove the URL entry for the old URL in db.docs.index,
> >and add a new entry to map the preferred URL to the existing DocID.
>
> Thanks for the info, I will look at it. I see that the problem is complex.
> I propose to implement it in two steps. The first step will insert a 'new'
> document as before, and compute its md5sum or whatnot, and store it.
> Whenever a document is indexed its 'was seen' condition will be tested,
> and if it was seen, then the new path will be dropped without insertion.
> This may keep the longer paths, but who cares (for now).

As Geoff pointed out, htdig will reach the documents with lowest hop
count first (at least in 3.2), so it may not be that bad.

> Incidentally,
> the host name IS a part of the stored canonical href so there should be no
> problem with pages duplicated across hosts, if the 'host aliases' are set
> up correctly in htdig.conf. Each page duplicated (no matter how) will pass
> the test. Which brings me to the next question:

Actually, it might make sense to have (optionally) cross-host duplicate
elimination too. This would be implemented by using only the MD5 sum
as a lookup key, rather than a server name and MD5 sum combination.

> If at present found words are inserted directly into the database, then to
> make my new patch work, I think that it would be necessary to make a
> separate db and insert the new words into it, for each file, followed by
> rejection of the file, if the file is 'known', or insertion into the main
> word database, if it is not known. I seem to remember that htdig reads the
> whole file in before processing. If this is so, then I need not bother
> about this imho, as I can checksum it while reading, and reject before
> parsing. Can this be done ?

Yes, htdig reads in the entire file content into a large string before
parsing, so you could get the checksum either while reading (in which
case the sum would need to be calculated in every transport method that
reads files), or between reading and parsing. You definitely want to
check the sum before parsing, to avoid having to remove words you've
already indexed, as well as to avoid all that extra processing for
nothing.

> Another question: is the http page header indexed or not. Because I need
> to md5sum only the page, sans base URL and http header info, for obvious
> reasons. Pages with any kind of dynamic content will hopelessly break this
> test (an access counter with inline output embedded in a page comes to my
> mind as an example).

htdig does hold on to some HTTP header info, but this is separate from
the file contents. Of course, the md5sum would be only on the contents
and not on the headers. This could potentially lead to an incorrect
modification time for a document if you index a more recent, but identical
copy of a document before reaching the original, but that's an issue of
poor web site management, and not a big concern for htdig. For hard or
symbolic links, even the header information (at least content-length
and last-modified date) will be identical.

> The second step will be to augment the code of the first step, with a
> 'replace-href-in-db-if-lower-hop-count' rule, that will apply to 3.2 code
> (I intend to do the first step on 3.1.5).
>
> So, do you have any pointers on this ?

Not beyond what I said before. Geoff seemed to suggest this isn't very
feasible, and I'd trust his judgement on this issue. As he said, 3.2
should reach the lower hop count path first, but this may be different
than the lowest nested directory count. This may be too much effort for
too little gain, as you may find that the first encountered URL is OK
in almost all cases.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu May 11 2000 - 10:26:32 PDT