Re: [htdig] update digging


Matt Edwards (medwards@go2net.com)
Tue, 2 Mar 1999 12:28:04 -0800 (PST)


On Tue, 2 Mar 1999, Frank Guangxin Liu wrote:
>
> Does that mean it won't discover/dig new URLs either?
>

It will dig new URLs, (unless you are limiting the # of pages/server, and
have already maxed this out).

I'm testing some mods to htDig to add an ability to ignore URLs in the
database and start only on the start_url.

This was easy on the surface, but tricky in practice because I wanted
to skip unchanged pages, but still follow their links. Adding a list
of HREFs for each document to the database allowed me to maintain a
breadth-first search order during an update dig. This is nice for me
because I want to frequently refresh an index of just the top 500 pages
of a server without starting from scratch each time.

I'd like to add this option to the build if anyone else would be
interested.

(P.S. You might also consider doing an initial dig on your subset and
  then merging the subset data into the full database when it's done)

Matthew Edwards (medwards@go2net.com) | The fuel of innovation and
Go2Net Inc. 999 Third Ave Suite 4700 | progress is freedom.
Seattle WA 98104 |

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST