RE: [htdig] 2 questions:   and bad_words


Subject: RE: [htdig] 2 questions:   and bad_words
From: NEPOTE Charles (Neuilly Gestion) (charles.nepote@cetelem.fr)
Date: Tue May 16 2000 - 08:32:28 PDT


According to Gilles Detilleux:

> According to "NEPOTE Charles (Neuilly Gestion)":
> > According to Gilles Detilleux:
> >
> > > According to "NEPOTE Charles (Neuilly Gestion)":
> > > > I have the same problem using a french locale (fr_FR),
> > > > on a Linux Mandrake 7.0 box.
> > > > As a newbie I won't hack the code... I am interested by Gille's
> > > > solution. Is
> > > > it possible to simply remap ascii char 160 to ascii char
> > > > 20. What are the files to modify ? How ?
> > > >
> > > > Is there a problem to change next ht://Dig version to
> > > > let the parser convert &nbsp to a space ?
> > > > Is it long and/or difficult ?

> > > My solution was to set the locale, but apparently that
> > > didn't do the
> > > trick on your system. I'm really not sure why. Geoff's solution
> > > was to patch the source. It's a trivial fix: just change
> > > the 160 on
> > > htdig/SGMLEntities.cc line 34 to a 32 (20 is the
> > > hexadecimal value of
> > > a space, not decimal), and recompile, reinstall htdig,
> > > and reindex.

> > (I tell you a secret : I installed via a RPM file ;-)
 
> That may be your problem right there! If you installed htdig from
> htdig-3.1.5-0.i386.rpm, it was built on a Red Hat 4.2 system
> with libc5,
> which doesn't properly support locales. Please provide more details
> about your system (distribution name and version, cpu type) and which
> RPM you installed. Your other messaage seemed to indicate that locale
> support was working, so I'm puzzled by the apparent discrepancy.

[copy/paste from my other message]
My config :
Pentium Pro 200
Linux Mandrake 7.0 ; automatic install in french.
(As I am a Linux newbie, I don't know which things would help you. One think
I am quite sure is I didn't made much changes on the original config. In
particular, I didn't make "locale" changes (I don't know how to do it
!...)).

ht://Dig 3.1.5 installed via a RPM specially made for Mandrake 7.0, by
MandrakeSoft, downloded at :
ftp://ftp.ciril.fr/pub/linux/mandrake-devel/contrib/RPMS/htdig-3.1.5-2mdk.i5
86.rpm
(note ftp.ciril.fr is an official mirror for MandrakeSoft).
I made an normal install of the RPM without changing anything but the
htdig.conf file :
 -- I add locale: fr_FR
 -- I modified other attributes which not deal with locale problem.

 
> > Ok. May be I will try : it will be my first time changing a
> source code (I'm
> > a bit afraid)...
>
> You may find it easier to install the src.rpm, and use the rpm command
> to build the source. That way, it's easier to replace one package
> with another. Of course, you will have to develop a patch
> file for the
> change to htdig/SGMLEntities.cc, and add it to the spec file.

I am going to try on another machine. Probably tomorow.

 
> > > The change is a bit different in version 3.2, as the SGML
> > > decoding has
> > > changed, but it should be simple there too. I don't
> > > think we want to
> > > make this a permanent change in the distributed source,
> > > though, because
> > > it may have some undesirable consequences for some users.
> > > Of course, it's open for discussion.
> >
> > Let's open the discussion.
> > Questions :
> > -- What sort of undesirable consequences can we have ?
>
> I don't know, but offhand the only thing I can think of is some users
> might prefer non-breaking spaces to remain non-breaking in
> the excerpts
> displayed in search results. It's probably not that big a
> deal, but we
> have been burned before when a seemingly innocuous change causes a lot
> of people to complain.

I understand.
So I would like to promote the new attribute solution.

 
> > -- Is there a case where the &nbsp has a lexicographic sence ?
> > -- Is it possible to have the choice to remap &nbsp (like
> having a new
> > attribute in htdig.conf (yes, I know, another one...)) ?
>
> It's certainly possible. The real question is whether this
> is desirable.
> The package is already suffering somewhat from feature bloat
> - the whole
> range of configuration attributes is very confusing to new
> users - so the
> decision to add another option must take that "cost" into
> consideration.

I agree that "the whole range of configuration attributes is very confusing
to new users".
So much job have been done, so much atribute added since the beginning ! But
may be there are other answers to that problem than stopping adding
attributes.
Why not reorganize the attributes ? You will see -- see below -- that
ht://Dig is not as difficult as you think whith a better organisation of the
attribute.

I think about something like (I decided to classify all the attributes (I
made it for me, maybe this help)) :

database_compression_level
database_uncoded_db_compatible

http_allow_virtual_host
http_authorization
http_http_proxy
http_http_proxy_exclude
http_nph
http_server_aliases
http_server_wait_time
http_timeout
http_user_agent

report_create_image_list
report_create_url_list

output_format_add_anchors_to_excerpt
output_format_build_select_lists
output_format_date_format
output_format_end_ellipses
output_format_end_highlight
output_format_excerpt_length
output_format_excerpt_show_top
output_format_iso_8601
output_format_match_method
output_format_matches_per_page
output_format_max_stars
output_format_maximum_pages
output_format_method_names
output_format_next_page_text
output_format_no_excerpt_show_top
output_format_no_excerpt_text
output_format_no_next_page_text
output_format_no_page_list_header
output_format_no_page_number_text
output_format_no_prev_page_text
output_format_nothing_found_file
output_format_no_title_text
output_format_page_list_header
output_format_page_number_separator
output_format_page_number_text
output_format_prefix_match_character
output_format_prev_page_text
output_format_script_name
output_format_search_results_footer
output_format_search_results_header
output_format_search_results_wrapper
output_format_sort
output_format_sort_names
output_format_star_blank
output_format_star_image
output_format_star_patterns
output_format_start_ellipses
output_format_start_highlight
output_format_substring_max_word
output_format_syntax_error_file
output_format_template_map
output_format_template_name
output_format_template_patterns
output_format_translate_amp
output_format_translate_lt_gt
output_format_translate_quot
output_format_use_star_image
output_format_version

pertinence_searching_allow_in_form
pertinence_searching_backlink_factor
pertinence_searching_date_factor
pertinence_searching_description_factor
pertinence_searching_heading_factor_1
pertinence_searching_heading_factor_2
pertinence_searching_heading_factor_3 [...]
pertinence_searching_keywords_factor
pertinence_searching_keyword_meta_tag_names
pertinence_searching_max_prefix_matches
pertinence_searching_meta_description_factor
pertinence_searching_minimum_prefix_length
pertinence_searching_search_algorithm
pertinence_searching_text_factor
pertinence_searching_title_factor
pertinence_searching_use_meta_description

self_configuration_bin_dir
self_configuration_common_dir
self_configuration_config_dir
self_configuration_database_dir
self_configuration_doc_db
self_configuration_doc_index
self_configuration_doc_list
self_configuration_endings_affix_file
self_configuration_endings_dictionary
self_configuration_endings_root2word_db
self_configuration_endings_word2root_db
self_configuration_htnotify_sender
self_configuration_image_list
self_configuration_image_url_prefix
self_configuration_include
self_configuration_maintainer
self_configuration_metaphone_db
self_configuration_soundex_db
self_configuration_synonym_dictionary
self_configuration_synonym_db
self_configuration_url_list
self_configuration_url_log
self_configuration_word_db
self_configuration_word_list

what_to_index_allow_numbers
what_to_index_bad_extensions
what_to_index_bad_querystr
what_to_index_bad_word_list
what_to_index_case_sensitive
what_to_index_exclude_urls
what_to_index_external_parsers
what_to_index_extra_word_characters
what_to_index_limits_urls_to
what_to_index_local_default_doc
what_to_index_local_urls
what_to_index_local_urls_only
what_to_index_local_user_urls
what_to_index_locale
what_to_index_max_description_lenght
what_to_index_max_doc_size
what_to_index_max_head_lenght
what_to_index_vmax_hop_count
what_to_index_max_keyword
what_to_index_max_meta_description_lenght
what_to_index_maximum_word_length
what_to_index_minimum_word_length
what_to_index_modification_time_is_now
what_to_index_noindex_start
what_to_index_noindex_end
what_to_index_pdf_parser
what_to_index_remove_bad_urls
what_to_index_remove_default_doc
what_to_index_robotstxt_name
what_to_index_server_max_docs
what_to_index_start_url
what_to_index_url_part_aliases
what_to_index_valid_extensions
what_to_index_valid_punctuation

??_build_select_list
??limit_normalized
??logging

Of course, we certainly could do a better classification.
Whith such a reorganization newbies will understand easier all the attribute
and you will see that there are not so much attribute... I understand that
changing the name of the attributes is a big change. So you may only change
:
 -- the documentation : you could add a "Attributes / By type" documentation
(not much work whith a database ; only add a field).
 -- the default htdig.conf ; for example you should add all fonction ordered
by type with the majority with a "#" to let the user see the choice he have
in what type of need.

What do you think about it ?

Best regards and so much thanks for your help !
Charles NÚpote.

x> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-unsubscribe@htdig.org
> You will receive a message to confirm this.
>



This archive was generated by hypermail 2b28 : Tue May 16 2000 - 06:22:09 PDT