Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Peter L. Peres (plp@actcom.co.il)
Date: Thu May 11 2000 - 11:25:08 PDT


On Thu, 11 May 2000, Gilles Detillieux wrote:

>indexing to /doc/javadoc, though, why not just set limit_urls_to
>to http://my.host.domain/doc/javadoc and not add anything else to
>limit_urls_to, so that you wouldn't get all the other stuff you didn't
>want? If I recall from the htdig.conf you posted yesterday (or was it
>Tuesday?), you had limit_urls_to set to accept any URL that contained
>your host name.

You must understand that this was only an example I gave. In reality, I
have a lot of packages that are visible in their entirety in the
DocumentRoot, and not all are on the same machine, so I can access many of
them only by http. This means that the find is a hack, and it is 'out'.
Their documentation part is linked to from a large page under
DocumentRoot. I want htdig to start there and follow those links, and NOT
climb up to the source and binaries and other things in the packages,
which are also accessible from under DocumentRoot, but from a link that
*IS* excluded using the exclude command in htdig conf.

The point is that I do *not* want to bother with each and every package
and thing. I want a unified naming convention such that htdig goes only
*there* and does not climb the parts of the tree it has no business
climbing. Before my patch htdig behaved like a backtracking algorythm and
would walk all the allowed tree, no matter where it entered it. With the
patch it is constrained to the subtree rooted at the entry point. Note
that I may have packages that have a doc section, and others that are a
single jumble of files. I cannot rely on having 'doc' or something in the
target.

This I have fixed. Now the next one.

>Sure, but if you know at any indexing run what you want to index (as
>opposed to everything you want to leave out), then it's usually pretty
>easy to specify that explicitly.

Yo, that's the point ! I do *not* know what I need to index, I only know
about a pointer to it ! And I want that pointer followed, and only its
subtree indexed (many of these subtrees), as opposed to the whole tree.

My starting point(s) for htdig are like: general docs, homepage, blabla,
development packages (page with links), imported packages(page with
links), documentation for this(page with links) and so on. It is also
possible to browse the whole thing using links from the homepage that lead
to the directory structure proper that leads to the packages (f.ex. ->
/usr/doc as I have described). These links will NOT be followed by htdig
(they are in the forbidden URL list of the htdig conf). I hope that you
begin to understand what I need now.

>I think that sort of pathname optimisation forduplicate elimination
>would be an excellent idea. I'm just not sure how you'd implement it.
>The 3.1.x series uses a different database structure, so anything you'd
>work out for it wouldn't transport easily to 3.2. As 3.1.x is strictly in
>maintenance mode now, I'd recommend that if you're serious about this, you
>implement it for 3.2 only. There, the db.docdb database has a field for
>each file that contains the URL, so it should be possible to change that
>field. The tricky part is that there is also the db.docs.index database,
>which maps URLs to DocIDs, which are used as keys for the db.docdb. So,
>you'd need to remove the URL entry for the old URL in db.docs.index,
>and add a new entry to map the preferred URL to the existing DocID.

Thanks for the info, I will look at it. I see that the problem is complex.
I propose to implement it in two steps. The first step will insert a 'new'
document as before, and compute its md5sum or whatnot, and store it.
Whenever a document is indexed its 'was seen' condition will be tested,
and if it was seen, then the new path will be dropped without insertion.
This may keep the longer paths, but who cares (for now). Incidentally,
the host name IS a part of the stored canonical href so there should be no
problem with pages duplicated across hosts, if the 'host aliases' are set
up correctly in htdig.conf. Each page duplicated (no matter how) will pass
the test. Which brings me to the next question:

If at present found words are inserted directly into the database, then to
make my new patch work, I think that it would be necessary to make a
separate db and insert the new words into it, for each file, followed by
rejection of the file, if the file is 'known', or insertion into the main
word database, if it is not known. I seem to remember that htdig reads the
whole file in before processing. If this is so, then I need not bother
about this imho, as I can checksum it while reading, and reject before
parsing. Can this be done ?

Another question: is the http page header indexed or not. Because I need
to md5sum only the page, sans base URL and http header info, for obvious
reasons. Pages with any kind of dynamic content will hopelessly break this
test (an access counter with inline output embedded in a page comes to my
mind as an example).

The second step will be to augment the code of the first step, with a
'replace-href-in-db-if-lower-hop-count' rule, that will apply to 3.2 code
(I intend to do the first step on 3.1.5).

So, do you have any pointers on this ?

Peter

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu May 11 2000 - 09:24:38 PDT