Re: htdig: rundig still not finding all urls in htdig.conf


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 26 Nov 1998 13:32:41 -0600 (CST)


According to Debra Wilcox:
> (Apologies in advance if you have seen this message before. I sent it at 12:30
> pm and still do not see it almost 5 hours later so reposting.)

Dunno about the rest of the list, but I only saw it once.

> Thank you for your help so far, unfortunately, the rundig still isn't finding
> all of the urls. I tried putting the backslash and carriage returning as
> suggested, it still only found one url. So, I have been trying various
> spacing-backslashing, this is the latest and best rate of find.
>
> (htdig.conf)
> start_url:
> <http://www.core.manhattan.ks.us//%A0>http://www.core.manhattan.ks.us/
> <http://www.co.riley.ks.us//>http://www.co.riley.ks.us/\
> <http://www.manhattan.k12.ks.us//%A0>http://www.manhattan.k12.ks.us/
> <http://www.ci.manhattan.ks.us//>http://www.ci.manhattan.ks.us/\
> <http://www.lib.ksu.edu//%A0%A0>http://www.lib.ksu.edu/\  
> <http://www.manhattan.lib.ks.us//%A0>http://www.manhattan.lib.ks.us/
> <http://www.ksu.edu/>http://www.ksu.edu\
> <http://www.manhattan.org/>http://www.manhattan.org/

Is your mailer adding all the <http://www.....> stuff at the start of
each line, and the hex A0 byte at the end of some lines, or do the lines
actually look that way in your htdig.conf? If they do, that's wrong!
It should read:

start_url: http://www.core.manhattan.ks.us/ \
        http://www.co.riley.ks.us/ \
        http://www.manhattan.k12.ks.us/ \
        http://www.ci.manhattan.ks.us/ \
        http://www.lib.ksu.edu/ \
        http://www.manhattan.lib.ks.us/ \
        http://www.ksu.edu/ \
        http://www.manhattan.org/

> Searches now find 3 sites, the
> <http://www.core.manhattan.ks.us/>http://www.core.manhattan.ks.us,
> <http://www.manhattan.lib.ks.us/>http://www.manhattan.lib.ks.us, and
> <http://www.ksu.edu/>http://www.ksu.edu
>
> Does anyone see something I am missing in how this could happen? My tech and I
> are wondering if pico is spacing things in such a way that the htdig software
> is not reading all that is there.

Do you mean pico is messing up your htdig.conf, or your *.html files?
Pico does line wrapping, which shouldn't be a problem with the *.html
files (if htdig has a problem with that, it would be a bug). However,
pico's line wrapping could be a problem with your htdig.conf file, as
any lines which are continued on the next line must end with a trailing
backslash. As long as you pay attention to line wrapping (or folding),
using pico to edit htdig.conf shouldn't pose a problem.

Take a close look at your entire htdig.conf file to see if there's anything
funny in it. (E.g. "cat -v -e -t htdig.conf | more" will show any hidden
control characters and such, and indicate the line endings with a "$".)
Also, if your htdig.conf file uses the limit_urls_to directive, pay close
attention to it too, for any strange characters or other oddities.

If that doesn't turn up any problems, try running "htdig -vvv" to see where
things go wrong. Note that for 30K+ documents, this will generate A LOT
of output, so you may want to save it in a file and browse through it.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:54 PST