Re: [htdig] directory index returned in search (and other questions)


Subject: Re: [htdig] directory index returned in search (and other questions)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Aug 02 2000 - 08:59:27 PDT


According to Stephen L Arnold:
> 1) I often get one or more directory indices (i.e., a URL that points
> to a directory) returned in the search results. Sometimes they come
> out with a high score (at the top) and sometimes a low score (at
> the bottom). I believe this is a result of using "doc" as a hidden
> search term, because the directory index is seen by htsearch as just
> another document (with a list *.doc files). This was my boss's idea
> so he could get search results without entering any search terms (just
> select from the select boxes on the form). I said this was probably
> not a good idea... Any work-around tips or other suggestions? I want
> Apache to index directories (in general) but I don't know of a way to
> turn it off in a given set of sub-directories. Can anything in HtDig
> help me?

This is a very common problem, but I'm afraid there's no easy answer.
htdig sees the directory listings that the web server generates just
as any other HTML page, so it will index them just as any other page.
If you can coax your web server into inserting a tag like the following
into the head section of the pages it generates for indexes, that would
be the solution.

        <meta name="robots" content="noindex,follow">

Unfortunately, I don't believe there's any way to do so.

Another option would be to generate index.html files yourself for any
directory that poses this problem. This would give you direct control
over what text htdig can or can't index from these files, but it would
mean you'd need to update these index files whenever you change the
contents of these directories.

> 2) Using catdoc to convert the doc files to text, I sometimes get
> binary garbage in the long-form results. Sometimes it's just a few
> characters, sometimes it's a *very* long string of garbage. Here is
> an example of the former:
>
> Word Document AR502-05.DOC
> PROJECT/TASK : TPS/502 REPORT NO. : AR502-05
> ^^^^^^^^^^
>
> I'd *really* like to get rid of these annoying garbage characters;
> I'm about to try a newer version of wv (wordview, whatever) to see
> if it helps. The funny thing is, it only happens on some word docs.
> Most are converted fine (i.e., without the garbage). Anybody have
> any tips for this one?

I believe catdoc has problems with newer Word document formats, while
wv has problems with older formats. If you can figure out what sort of
"magic number" or signature in the header of these documents destinguishes
one from the other format, you could probably patch the doc2html.pl
script to use one or the other as required. Bear in mind that catdoc
generates plain text, while I think wm produces HTML, so you'd have to
configure doc2html appropriately for each filter type.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Aug 01 2000 - 22:58:25 PDT