Re: [htdig] new to htdig

Subject: Re: [htdig] new to htdig
From: Robert Marchand (robert.marchand@UMontreal.CA)
Date: Wed Feb 23 2000 - 07:16:24 PST


At 08:38 00-02-23 -0600, Geoff Hutchison wrote:
>At 4:36 PM -0500 2/22/00, Robert Marchand wrote:
>>1) We badly need the 'fuzzy accent' algorithm or whatever the solution
>>would be to be able to search a word with and without accents: like
>>"Montréal" and "Montreal" and get the same results. This is very
>>important for us. I've look at some discussion on this topic here and
>>would like to know if it is soon to be released. If not, then we will
>>have to find a quick-and-dirty solution like patch some files by
>It is not likely to be released soon. However, it won't require
>patching files--it will require a new class in htfuzzy/ along the
>lines of the Substring class (or the Speling class in 3.2.0b1). If
>you'd like some suggestions about how to do that, let me know.

Well, yes I would be interested in that.

>>2) We have a problem with robots.txt and the database. It seems that if
>>the file robots.txt is modified or added after a complete reindex from
>>scratch and BEFORE an update reindex, some files that are now no more
>>accepted are keeped in the database. Does it means that a complete
>>reindex has to be done after a change in a robots.txt? That seems a bit
>>harsh. We have no control over all the sites to index.
>Yes, but think of it like this. You tell me that you want me to make
>a map of your house. You give me a certain set of keys (i.e. I can
>only get into certain rooms). I go off and do this and then you want
>me to give back some keys. I still have the map that I made though!
>The analogy I'm trying to make is that for the change in robots.txt
>to affect the database, it would have to "forget" parts of what it

Yes in that analogy you would still have the map but not the keys.
With htsearch, you still have everything. You search and you found. I can
understand the
point of view but why is it different of a "404" response? In that case
the URL would be removed (depending
of the configuration file).

>In short, it might sound harsh, but it wouldn't be easy to realize
>that because of a change in robots.txt (remember, we don't store
>them), we need to remove certain URLs. What if there's a page that's
>disallowed that linked to a section that would still be allowed but
>is now unreachable from other URLs?

Yes, the whole problem of removing URLs is bigger than I first thought. But
still we have maybe over 100 sites
to index and we can't know when someone will change his robots file. For me
it means we'll have to do a whole
reindex from scratch maybe every week. I would have prefered otherwise.



Robert Marchand tél: 343-6111 poste 5210
DiTER-SDI e-mail:
Université de Montréal Montréal, Canada

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Wed Feb 23 2000 - 07:20:11 PST