Re: htdig: Pages get indexed, but no results: BUG?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 22 Dec 1998 16:45:45 -0600 (CST)


Last week I wrote:
> According to Andriu:
> > I actually did run htdig -sivvv and I did see that the pages which are linked from
> > products.htm were defenitely indexed - there are at least 100 pages linked from
> > products.htm so its certain that they have been indexed.
> >
> > But when I search with keywords from these pages, htsearch does not find any results - I
> > made several tests, so Im sure.
> >
> > htsearc finds pages from that site which do not start from products.htm - no problem
> > there.
> >
> > That is why Im assuming that the pages get indexed and then deleted again.
> >
> > That is also why I think that it does not help when I take a start URL starting directly
> > from products.htm.
>
> Well, I throw my hands up on this one. I was able to reproduce the
> problem here, with 3.1.0b3, but I'm at a loss to explain it. As far as I
> can tell htdig will index the file if it sees the lowercase URL first,
> and fail to index it if it sees the uppercase URL first. However,
> it wasn't showing up in the search. I'm baffled. Also, if you put a
> page that contains the lowercase URL as the first page in the start_url
> list, it doesn't quite work either. In this case, the page shows up in
> the search, but it shows up with the uppercase URL! Wierd. However,
> if you put the products.htm page itself as the first URL in start_url,
> it does seem to work - at least with 3.1.0b3. But you have to explicitly
> give the limit_urls_to, or htdig seems to get confused.

OK, after looking at Retriever.cc a bit more, for another reason, I came
across something that I think explains some of this behaviour. Every time
htdig sees an href, it updates the docdb record for that URL, to update
the backlink count and add the new description text. It also sets the URL
in the database to the URL it got in the latest href. I'm not sure why
it does this, but it would explain the strange behaviour. So, to avoid
problems with the missing upper-case file name, you'd have to make sure
that htdig sees the lower-case file name first, so it actually digs the
real document rather than getting an error, plus you have to make sure
that the last href to that document that htdig sees has the lower-case
file name as well, or else the wrong file name ends up in the docdb!

That still doesn't explain why in some cases pages appeared to be dug,
but didn't show up in the searches. Maybe that'll come to me sometime
in the new year. :) I still maintain that this isn't a problem on a
properly configured server, with properly set up hrefs in your documents,
so I don't think I'll go out of my way to solve this one.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:55 PST