[htdig] url_part_aliases


Subject: [htdig] url_part_aliases
From: Jim Cole (greyleaf@yggdrasill.net)
Date: Sun Sep 03 2000 - 04:04:44 PDT


Hi - I am having a problem with url_part_aliases, and after a day of
fighting with it I have decided to give up and beg for help ;)

Here is the situation (using 3.1.5). I am building two sets of databases
that need to be merged. Database A covers a more or less typical crawl
with starting urls, limiting urls, and however many hops are necessary
to follow the appropriate links. Database B picks up a number of
stragglers that are not linked, using specific files as starting urls
with a max_hop_count of 0.

Database A has a bunch of url_part_aliases pairs of the form /~blah *1
/~bleh *2 etc. Database B does not use url_part_aliases and there is no
URL associated with that database that contains any of the above "from"
parts.

I run htdig on A an B with no apparent problems. I run htmerge on B with
no apparent problems. Then I run htmerge the last time on A and pull in
B with the -m option. Everything still looks good at this point.

Then I try htsearch using a different configuration file that has a
url_part_aliases pairs of the form /some/newblah *1 /some/newbleh *2
etc. The url_part_aliases appear to be working just as I had hoped.
However, there is an unpleasant side effect. There are a large number
of documents from database B that will not show up on the result pages.
The documents *are* there and htsearch does find them, but they will
not display.

If I pick very specific text from some of the documents that went into
database B, htsearch will report one or two hits, but otherwise show no
results. Likewise, many of the searches I have tried give me a number of
hits that is not the same as the number of summaries displayed. If I
comment the url_part_aliases line out of the search config file, the
missing documents reappear and the number of hits match the number of
summaries. But then, of course, I end up with broken URL's everywhere
that url_part_aliases was applied during the dig.

Any idea on what is going on here? My only guess at this point is that
it has something to do with there being some substring overlap between
the real URL's in B and the URL's to which some of the documents in A
are being rewritten. But I have no idea why that would be a problem or
how to get around it.

Jim

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Sun Sep 03 2000 - 04:06:12 PDT