Re: [htdig] update vs. initial digging


Joseph Cheek (joseph@cheek.com)
Fri, 28 May 1999 21:59:40 -0700


hello,

an htdig -v shows that the directories queried returned "retrieved but no
change". Therefore the new html files were never indexed at all with the update
dig [all files returned retrieved but no change, and the new files created since
last dig were *not* indexed at all]:

New server: linuxnews.cheek.com, 80
0:0:255:http://linuxnews.cheek.com/: retrieved but not changed
1:2:1:http://linuxnews.cheek.com/a.biaies.mis/: retrieved but not changed
2:167:2:http://linuxnews.cheek.com/a.biaies.mis/269623.php: retrieved but not
changed
3:168:2:http://linuxnews.cheek.com/a.biaies.mis/269636.php: retrieved but not
changed
..
..
..
16872:16873:2:http://linuxnews.cheek.com/utah.linux/1408.php: retrieved but not
changed
16873:16874:2:http://linuxnews.cheek.com/utah.linux/1409.php: retrieved but not
changed
16874:16875:2:http://linuxnews.cheek.com/utah.linux/1410.php: retrieved but not
changed
16875:16876:2:http://linuxnews.cheek.com/utah.linux/1411.php: retrieved but not
changed
htdig: Run complete
htdig: 1 server seen:
htdig: linuxnews.cheek.com:80 16876 documents
htmerge: Total word count: 121969
htmerge: Total documents: 16876
htmerge: Total doc db size (in K): 116763

i am using apache's directory indexing to give the links to the html files
themselves, instead of creating an index file myself. i want to just point
htdig to http://linuxnews.cheek.com/ as the start_url and let it walk down the
tree itself.

so since all files are getting the "retrieved but not changed" message, does
that mean that apache is telling htdig that nothing has changed in the document
root of http://linuxnews.cheek.com/? if so, that is a blatant lie 8-). is
there any way to verify this, somehow by telnetting to port 80 and typing a
request in by hand?

thanks!

joe

Geoff Hutchison wrote:

> At 2:59 PM -0400 5/26/99, Joseph Cheek wrote:
> >is this a bug, or by design? i would expect update digging to reindex the
> >pages, since they had been modified since the initial dig.
>
> Indeed it should.
>
> >as further consequence of this update-vs-initial dig problem, i have a web
> >site that continually adds new pages to the site. update digs never see the
>
> This is very odd. It sounds like there's some sort of miscommunication
> between your server and ht://Dig. Basically, it seems like htdig is
> assuming that these documents aren't changed when they have, in fact,
> changed on disk.
>
> So... My usual $0.02: try running htdig with some debugging turned on. In
> this case, I'd go for 'htdig -v' which should show you whether documents
> are reparsed, or what their status is. Basically, there are three possible
> responses for a document:
>
> 1) Document has been downloaded and parsed. +++**--- are all indications of
> part of the parsing:
> 0:2:0:http://www.htdig.org/: ++ size = 373
>
> 2) Document was in database, htdig sent If-Modified-Since header, and
> server sent an 'unchanged' response.
> 0:2:0:http://www.htdig.org/: Not changed
>
> 3) Document was in database, htdig sent If-Modified-Since header, and
> server sent document. Htdig did not index because Last-Modifed header was
> the same as the date in the database.
> 0:2:0:http://www.htdig.org/: Retrieved but not changed
>
> So I'm wondering if you're seeing #2 and #3 when you should be seeing #1...
> What are the responses to your new pages?
>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/

--
      ___            ___
   __ | |_   __   __ | |_      __   __   _____  * Joseph Cheek, Director
  / _)|   \ / _) / _)|  _)    / _) /  \ |     | * joseph@cheek.com or
 ( (_ | | |(  _)(  _)|  \  _ ( (_ ( () )| |_| | * (877) CHEEK.COM
  \__)|_|_| \__) \__)|_\_)(_) \__) \__/ |_| |_| * http://www.cheek.com/
    Cheek Consulting, Seattle, provides Linux and Internet solutions
   linux * web commerce * html * java * perl * php * informix * mysql

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri May 28 1999 - 21:21:31 PDT