Re: [htdig] update digging


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Wed, 3 Mar 1999 08:39:44 -0500 (EST)


>
> On Tue, 2 Mar 1999, Frank Guangxin Liu wrote:
> >
> > Does that mean it won't discover/dig new URLs either?
> >
>
> It will dig new URLs, (unless you are limiting the # of pages/server, and
> have already maxed this out).

Consider this scenario, on the initial dig, some of my web servers were
down, so the statistics from the original htdig shows 0 documents for
those servers. Now on the update dig, those servers are up. I would imagine
htdig will fully dig them, unfortunately, that is not the case.
The statistics from the update htdig doesn't show those servers at all,
not even 0 documents.

Frank

>
> I'm testing some mods to htDig to add an ability to ignore URLs in the
> database and start only on the start_url.
>
> This was easy on the surface, but tricky in practice because I wanted
> to skip unchanged pages, but still follow their links. Adding a list
> of HREFs for each document to the database allowed me to maintain a
> breadth-first search order during an update dig. This is nice for me
> because I want to frequently refresh an index of just the top 500 pages
> of a server without starting from scratch each time.
>
> I'd like to add this option to the build if anyone else would be
> interested.
>
> (P.S. You might also consider doing an initial dig on your subset and
> then merging the subset data into the full database when it's done)
>
> Matthew Edwards (medwards@go2net.com) | The fuel of innovation and
> Go2Net Inc. 999 Third Ave Suite 4700 | progress is freedom.
> Seattle WA 98104 |
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word "unsubscribe" in
> the SUBJECT of the message.
>
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:19 PST