Re: [htdig] htdig / Suse 6.2: very long run ?


Subject: Re: [htdig] htdig / Suse 6.2: very long run ?
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Apr 25 2000 - 12:38:16 PDT


At 9:57 PM +0300 4/25/00, Peter L. Peres wrote:
> Then, I added my own HTML and PDF docs to the site, and things stopped
>being ok.
> Problem: htdig has been running for 24+ hours (i486/100MHz, 24MB RAM,
>lots of disk space). The data to be indexed is not larger than 80MB.
> I have run the htindex command several times so far (interrupted in the
>middle etc). The last time(s) I generate a URL and image list.

OK, my first suggestion is to disable any cron-jobs or anything that
might write to the databases. Then I'd delete all your old ones and
reindex from scratch. I would guess that your databases are pretty
much dead right now. When you reindex, I'd probably use -v so you can
see the URLs it's indexing as it goes. This way you can see if it
gets into a loop.

But right now if you stop it in the middle of a run, there's no
guarantee your databases are worth much. You might want to use the -l
flag, which will trap interruptions and attempt to exit gracefully.
Of course this also means exits will take some time.

> This list was looked at using a command like:
>
>grep <...db.urls http://myhost.here|sort -r|uniq -d|less

To quote from the docs: <http://www.htdig.org/attrs.html#create_url_list>

If set to true, a file with all the URLs that were seen will be
created, one URL per line. This list will not be in any order and
there will be lots of duplicates, so after htdig has completed, it
should be piped through sort -u to get a unique list

Remember that this list will also have all sorts of invalid URLs that
ht://Dig saw. It's not a list of the URLs in the database.

> Is there a way to do the initial dig using a list of URLs ? I am tempted
>to make a giant URL list page using the URL list produced by htdig, after
>running it through uniq, and then let htdig index that, with a depth of 1.

Yes, you can include a file into any attribute easily, e.g.:
start_url: `/path/to/file`

A few notes:
1) You'll want a hopcount of 0. If you index to a depth of 1, this
will include all links from these pages.
2) The indexer keeps a very good list of URLs--it will never reindex
a page with the same URL. (Note the last part of that--you might have
"the same page" with different URLs.)
3) If you index this way, you will use a more memory since it will
have to assemble the URL list at once. Normally, it can add a few
links at a time to the list, so it has less overhead.

> Have I grossly exceeded htdig's limits ? ;-)

Hardly. Many people regularly index many times that.

> When is a built-in uniq URL feature scheduled ?

If you want to know if it will dump out a list of the URLs in the
database, you can do this in 3.2.0b2 using htstat. In previous
versions, you can also get this from parsing the db.docs file created
by using the -t switch.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Apr 25 2000 - 10:28:44 PDT