Re: [htdig] avoiding to re-index present files


Subject: Re: [htdig] avoiding to re-index present files
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Jan 18 2000 - 07:58:39 PST


According to GOMEZ Henri:
> I've got a project which index a huge number of pdf files.
>
> PDF files are created each days by some internal process
> and so I got around 1000 pdf new files each days (and I keep the oldest).
>
> I dynamically regenerate an index file (index.html) each days in the many
> subdirs
> where the pdf files are stored.
>
> How could I tell htdig to only index the newly arrived files ?

As long as you don't use htdig's -i option (i.e. don't just use an
unmodified rundig script for updating), then htdig will only index
new or modified documents. If you index via the local filesystem,
using local_urls, this will be very quick. If you index via an HTTP
server, this will still work very well as long as the server honours the
If-Modified-Since header (i.e. returns a 304 status for older documents)
and returns a Last-Modified header. If the HTTP server does not honour
the If-Modified-Since header, but does return a valid Last-Modified
header, it will still work, but the PDF files will be needlessly re-read
(but not re-indexed) each time.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Jan 18 2000 - 08:00:43 PST