Subject: Re: [htdig] avoiding to re-index present files
From: Gilles Detillieux (email@example.com)
Date: Tue Jan 18 2000 - 07:58:39 PST
According to GOMEZ Henri:
> I've got a project which index a huge number of pdf files.
> PDF files are created each days by some internal process
> and so I got around 1000 pdf new files each days (and I keep the oldest).
> I dynamically regenerate an index file (index.html) each days in the many
> where the pdf files are stored.
> How could I tell htdig to only index the newly arrived files ?
As long as you don't use htdig's -i option (i.e. don't just use an
unmodified rundig script for updating), then htdig will only index
new or modified documents. If you index via the local filesystem,
using local_urls, this will be very quick. If you index via an HTTP
server, this will still work very well as long as the server honours the
If-Modified-Since header (i.e. returns a 304 status for older documents)
and returns a Last-Modified header. If the HTTP server does not honour
the If-Modified-Since header, but does return a valid Last-Modified
header, it will still work, but the PDF files will be needlessly re-read
(but not re-indexed) each time.
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Jan 18 2000 - 08:00:43 PST