Re: [htdig] url_part_aliases


Subject: Re: [htdig] url_part_aliases
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Sep 06 2000 - 08:53:32 PDT


This seems remarkably similar to the problems reported by Stefan Reich,
only he didn't mention anything about merging two databases together.
Which platform are you running ht://Dig on, and did you use a pre-compiled
binary or build from source?

I took a quick look at the code that merges databases, and it doesn't
seem to do any URL encoding or decoding at all using url_part_aliases,
which could lead to some problems, but I don't think those would be
related to what you're running into. If only database A uses encodings,
and B is folded into A rather than the other way around, I don't think
that would be a problem as long as none of the unencoded URLs in B are
supposed to override encoded URLs in A.

I'm afraid I don't quite understand all the url_part_aliases handling
enough to come up with a reasonable guess as to where the problem may
lie, but I'd like to rule out platform-specific problems first. This
attribute has been around, and used, for quite some time, and your
trouble report and Stefan's are the first I've seen where problems
like this have come up, IIRC.

According to Jim Cole:
> Hi - I am having a problem with url_part_aliases, and after a day of
> fighting with it I have decided to give up and beg for help ;)
>
> Here is the situation (using 3.1.5). I am building two sets of databases
> that need to be merged. Database A covers a more or less typical crawl
> with starting urls, limiting urls, and however many hops are necessary
> to follow the appropriate links. Database B picks up a number of
> stragglers that are not linked, using specific files as starting urls
> with a max_hop_count of 0.
>
> Database A has a bunch of url_part_aliases pairs of the form /~blah *1
> /~bleh *2 etc. Database B does not use url_part_aliases and there is no
> URL associated with that database that contains any of the above "from"
> parts.
>
> I run htdig on A an B with no apparent problems. I run htmerge on B with
> no apparent problems. Then I run htmerge the last time on A and pull in
> B with the -m option. Everything still looks good at this point.
>
> Then I try htsearch using a different configuration file that has a
> url_part_aliases pairs of the form /some/newblah *1 /some/newbleh *2
> etc. The url_part_aliases appear to be working just as I had hoped.
> However, there is an unpleasant side effect. There are a large number
> of documents from database B that will not show up on the result pages.
> The documents *are* there and htsearch does find them, but they will
> not display.
>
> If I pick very specific text from some of the documents that went into
> database B, htsearch will report one or two hits, but otherwise show no
> results. Likewise, many of the searches I have tried give me a number of
> hits that is not the same as the number of summaries displayed. If I
> comment the url_part_aliases line out of the search config file, the
> missing documents reappear and the number of hits match the number of
> summaries. But then, of course, I end up with broken URL's everywhere
> that url_part_aliases was applied during the dig.
>
> Any idea on what is going on here? My only guess at this point is that
> it has something to do with there being some substring overlap between
> the real URL's in B and the URL's to which some of the documents in A
> are being rewritten. But I have no idea why that would be a problem or
> how to get around it.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Sep 06 2000 - 08:55:17 PDT