Re: htdig: Pages get indexed, but no results: BUG?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 18 Dec 1998 11:46:37 -0600 (CST)


According to Andriu:
> I actually did run htdig -sivvv and I did see that the pages which are linked from
> products.htm were defenitely indexed - there are at least 100 pages linked from
> products.htm so its certain that they have been indexed.
>
> But when I search with keywords from these pages, htsearch does not find any results - I
> made several tests, so Im sure.
>
> htsearc finds pages from that site which do not start from products.htm - no problem
> there.
>
> That is why Im assuming that the pages get indexed and then deleted again.
>
> That is also why I think that it does not help when I take a start URL starting directly
> from products.htm.

Well, I throw my hands up on this one. I was able to reproduce the
problem here, with 3.1.0b3, but I'm at a loss to explain it. As far as I
can tell htdig will index the file if it sees the lowercase URL first,
and fail to index it if it sees the uppercase URL first. However,
it wasn't showing up in the search. I'm baffled. Also, if you put a
page that contains the lowercase URL as the first page in the start_url
list, it doesn't quite work either. In this case, the page shows up in
the search, but it shows up with the uppercase URL! Wierd. However,
if you put the products.htm page itself as the first URL in start_url,
it does seem to work - at least with 3.1.0b3. But you have to explicitly
give the limit_urls_to, or htdig seems to get confused.

> Also I dig several sites, so would it make sense to use limit urls to?

It's hard to avoid using it, I think. By default it's set to ${start_url},
which works for simple cases where start_url lists the main page of one or
more web sites. However, when you get htdig to start at a deeply nested
page somewhere on one site, you need to explicitly set limit_urls_to to
include everything you want included, from all sites you dig. E.g. you
can set it to the list of main page URLs for every site you dig:

limit_urls_to: http://www.mysite.com/ \
        http://www.alma.mater.edu/ \
        http://www.htdig.org/ \
        http://www.something.org/
start_url: http://www.mysite.com/my/ownpage/products.htm \
        http://www.mysite.com/ \
        http://www.alma.mater.edu/ \
        http://www.htdig.org/ \
        http://www.something.org/

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:54 PST