Re: [htdig] some directories not indexed, another dig barfs


Subject: Re: [htdig] some directories not indexed, another dig barfs
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Aug 28 2000 - 10:02:12 PDT


According to Stephen L Arnold:
> 1) I'm trying to index a bunch of word docs, which worked fine last
> time, and html content in separate databases. The html dig goes
> fine, but the word doc dig barfs with the db error. It worked fine
> before I updated the word doc tree; I'm thinking it might be a
> permissions problem (I can't check that machine now, as it's at
> work). What do the permissions need to be in the content tree?

The permissions must be as for anything you want to publish on the
web: your files must be readable by the user ID under which your web
server runs. In addition, if indexing files using local_urls, then
files must also be readable by the user ID under which you run htdig.

I highly doubt a permission problem would lead to a db error, though.
More likely, I'd assume that catdoc is barfing up garbage on some of
the Word docs it's trying to parse, and some of those garbage "words"
are polluting your databases. I'd try to narrow down which of the
new documents in the updated tree are the ones it's having trouble
with, and either remove or clean them up - perhaps saving them as an
earlier Word document version. Another option might be to find another
filter than catdoc to convert Word files, or upgrade to a more recent
version of catdoc if one is available.

> 2) On another machine, a standard html dig only sees 2 directories,
> but not the others (it digs those two directories fine, then thinks
> it's done). All directories have the same owner/group and rx
> permissions, and are at the top-level apache document root. On two
> other machines at my house, it works fine (ie, it sees all the
> directories, including the symlink from /home/httpd/html ->
> /usr/doc/HTML, where the LDP docs are). Any ideas?

First of all, it's important to realise that htdig doesn't read
directories itself. It only follows HTML links from one document to
other documents. It can't tell the difference between a directory
listing automatically generated by Apache (or other web server) and a
regular HTML file. In fact, Apache is just spitting out a standard HTML
document for the directory listing.

When htdig isn't indexing files you think it ought to, you need to
determine whether it's rejecting certain URLs, and if so, why, or whether
it's even seeing links to the documents you want to index.

http://www.htdig.org/FAQ.html#q4.1
http://www.htdig.org/FAQ.html#q5.1
http://www.htdig.org/FAQ.html#q5.18

You may also want to check if Apache is configured to allow symbolic
links. Do you see all the files you need to index when you access the
site from a web browser, starting from your start_url and following
links?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Tue Aug 29 2000 - 00:13:44 PDT