Re: htdig: Htdig and wwwoffle


Andrew M. Bishop (amb@gedanken.demon.co.uk)
Tue, 1 Sep 1998 20:30:50 +0100


Hi,

I have now installed htdig (3.08b2) and tried to use it, this raises
more questions.

> > 2) htdig would need to use only the URLs provided to be searched, not
> > follow links.
>
> This is already there. You could simply provide the list as the start_urls
> and set max_hops to 0. Though I haven't tried this sort of thing, this
> should do what you want--ht://Dig will index the URLs you provide and
> won't follow any links.

I tried having in the config file.

: start_url: http://localhost:8080/
: max_hops: 0

And when I ran the rundig script I got the following:

: {gedanken:92} /usr/local/htdig/bin/rundig -v
:
: New server: localhost, 8080
: 0:0:0:http://localhost:8080/: not found
: htdig: Run complete
: htdig: 1 server seen:
: htdig: localhost:8080 1 document
:
: htdig: Errors to take note of:
: Not found: http://localhost:8080/ Ref:
: htmerge: Sorting...
: htmerge: Removing doc #0

The connection that was made to the proxy server on port 8080 was for
the URLs http://localhost/robots.txt and http://localhost/ the port
number was lost when the proxy is used.

This method will not be convenient for use with wwwoffle anyway since
there may be 10,000 URLs, this would make quite a long config file and
is certainly not the most efficient way of doing it. If htdig could
read from stdin then it would be better.

> > 3) htdig would need to not use the robots.txt because these will not
> > have been cached.
>
> Hm. Well ht://Dig checks for the existence of the file. Perhaps a request
> to wwwoffle for the robots.txt should just return 404? This may already be
> the current behavior. If ht://Dig doesn't find the file, it assumes there
> are no restrictions (as per the standard).

WWWOFFLE would return a 404, but then next time that it is online it
would fetch the robots.txt that had been requested. There are
configuration file options for wwwoffle to stop this, but it would
block all robots.txt and not just those that htdig is requesting. The
ability to be able to turn off the requesting of robots.txt would be a
big help.

> > 4) wwwoffle will need to provide the CGI interface to htdig.
>
> In answer to your question and the question on writing a Java servlet,
> you don't have to use htsearch directly to interface to ht://Dig. For one,
> the databases and config files are all there for anyone to use. For
> another, htsearch will run from the command line, which can circumvent CGI
> problems somewhat.

Where are these command line options described, I could not find them
in the documentation.

I was planning to use the search forms etc. that you provide, this
would ensure that htdig gets publicity when it is used as part of
wwwoffle as well as me not having to understand exactly how it works
and what the command line parameters are.

I am having some further problems running htdig.

When the config file contains the following

: start_url: http://www.gedanken.demon.co.uk/
: max_hops: 0

I get:

: {gedanken:94} /usr/local/htdig/bin/rundig -v
:
: New server: www.gedanken.demon.co.uk, 80
: htdig: Run complete
: htdig: 1 server seen:
: htdig: www.gedanken.demon.co.uk:80 0 documents
: htmerge: Unable to open word list file '/usr/local/htdig/db/db.wordlist'

It is not even contacting the proxy server to get the URL.

-- 
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop                             amb@gedanken.demon.co.uk
                                      http://www.gedanken.demon.co.uk/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:41 PST