Subject: [htdig] directory index returned in search (and other questions)
From: Stephen L Arnold
Date: Tue Aug 01 2000 - 17:04:37 PDT


I now have a simple custom search form defined, with one selection box for topic area (using the exclude parameter to htsearch), and one for document (report) type using the restrict parameter. It seems to work the way I want it to now, except for a couple issues. I'm using this custom search on a tree of M$ Word documents. I also use HtDig on a large html tree, but I don't have any of the following problems with that one.

1) I often get one or more directory indices (i.e., a URL that points to a directory) returned in the search results. Sometimes they come out with a high score (at the top) and sometimes a low score (at the bottom). I believe this is a result of using "doc" as a hidden search term, because the directory index is seen by htsearch as just another document (with a list *.doc files). This was my boss's idea so he could get search results without entering any search terms (just select from the select boxes on the form). I said this was probably not a good idea... Any work-around tips or other suggestions? I want Apache to index directories (in general) but I don't know of a way to turn it off in a given set of sub-directories. Can anything in HtDig help me?

2) Using catdoc to convert the doc files to text, I sometimes get binary garbage in the long-form results. Sometimes it's just a few characters, sometimes it's a *very* long string of garbage. Here is an example of the former:

Word Document AR502-05.DOC
     PROJECT/TASK : TPS/502 REPORT NO. : AR502-05

I'd *really* like to get rid of these annoying garbage characters; I'm about to try a newer version of wv (wordview, whatever) to see if it helps. The funny thing is, it only happens on some word docs. Most are converted fine (i.e., without the garbage). Anybody have any tips for this one?

Thanks in advance for any suggestions, Steve

