htdig: Skipping parts of a document


Marjolein Katsma (webmaster@javawoman.com)
Tue, 12 Jan 1999 07:30:49 +0100


Sometimes it's useful not to index parts of a document. Some examples:
- When using anchors in the search results to jump to the appropriate part
in the text (see a previous contribution), jumping to a top-of-page menu is
hardly relevant;
- Code (maybe produced by an external server) that changes faster than the
indexing cycle, for instance daily news headlines
- Deleted text (<DEL></DEL> HTML 4.0)

This patch allows placing start and end markers in the text so that
anything in-between will not be indexed; but existing tags can also be used
(for instance <DEL> and </DEL> in my last example!). Default markers are
<!--htdig_noindex--> and <!--/htdig_noindex-->. Corresponding config
parameters are noindex_start and noindex_end.

Two patches:

(1)
Defaults.cc - this patch is compared to release 3.1.0b4 and is a
*replacement* for the previously posted patch (just a correction to a
comment though); it contains a few other changes needed for other features.

diff -c3p defaults.cc defaultsMK.cc
*** defaults.cc Tue Dec 22 18:53:12 1998
--- defaultsMK.cc Mon Jan 11 10:41:35 1999
***************
*** 3,8 ****
--- 3,22 ----
  //
  // default values for the ht programs
  //
+ // Revision 1999-01-11 mkatsma
+ // Added options translate_amp, translate_lt_gt and translate_quote to enable
+ // configuration of whether or not entities for '&', '<', '>' and '"' will
+ // be translated. The default is true, leaving the normal operation of htdig
+ // unchanged.
+ //
+ // Revision 1999-01-10 mkatsma
+ // Implemented configurable 'no title' text (found on mail list archive)
+ //
+ // Revision 1999-01-06 mkatsma
+ // Added options noindex_start and noindex_end to enable NOT indexing
+ // some sections of code; useful to exclude such things as local page menus
+ // and server-generated code that changes faster than an indexing cycle.
+ //
  // $Log: defaults.cc,v $
  // Revision 1.24 1998/12/11 02:49:54 ghutchis
  // Added option for server_max_docs as a limit on the number of docs returned
*************** ConfigDefaults defaults[] =
*** 168,173 ****
--- 182,190 ----
      {"no_excerpt_show_top", "false"},
      {"no_next_page_text", "[next]"},
      {"no_prev_page_text", "[prev]"},
+ {"no_title_text", "[No title]"},
//mk19990110
+ {"noindex_start", "<!--htdig_noindex-->"},
//mk19990106
+ {"noindex_end", "<!--/htdig_noindex-->"},
                 //mk19990106
      {"nothing_found_file", "${common_dir}/nomatch.html"},
      {"page_list_header", "<hr noshade size=2>Pages:<br>"},
      {"prefix_match_character", "*"},
*************** ConfigDefaults defaults[] =
*** 195,200 ****
--- 212,220 ----
      {"text_factor", "1"},
      {"timeout", "30"},
      {"title_factor", "100"},
+ {"translate_amp", "false"},
 //mk19990111
+ {"translate_lt_gt", "false"},
                 //mk19990111
+ {"translate_quot", "false"},
                 //mk19990111
      {"url_list", "${database_base}.urls"},
      {"use_star_image", "true"},
      {"use_meta_description", "false"},

(2)
Patch to HTML.cc - this is in comparison with my previous version with the
modified comments-skipping algorithm (previous post):

javawoman: {10} % diff -c3p HTMLcommentMK.cc HTMLMK.cc
*** HTMLcommentMK.cc Mon Jan 11 22:46:49 1999
--- HTMLMK.cc Mon Jan 11 23:21:31 1999
***************
*** 3,8 ****
--- 3,12 ----
  //
  // Implementation of HTML
  //
+ // Revision 1999-01-09 mkatsma
+ // Added algorithm to skip text between configurable markers so it will
+ // not be indexed.
+ //
  // Revision 1999-01-07/1999-01-09 mkatsma
  // Modification of comment-filtering algorithm so it skips all legal SGML
  // comment declarations, including ones with whitespace after the last
*************** HTML::parse(Retriever &retriever, URL &b
*** 188,193 ****
--- 192,213 ----

      while (*position)
      {
+
+ //
+ // Filter out section marked to be ignored for indexing.
+ // This can contain any HTML.
+ //
+ char *skip_start = config["noindex_start"];
+ char *skip_end = config["noindex_end"];
+ if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+ {
+ q = (unsigned char*)strstr((char *)position, skip_end);
+ if (!q)
+ *position = '\0'; // Rest of document will be
skipped...
+ else
+ position = q + strlen(skip_end);
+ continue;
+ }

        // Improved algorithm 1999-01-07 Marjolein Katsma
        // (with help from Gilles Detillieux)

Marjolein Katsma webmaster@javawoman.com
Java Woman - http://javawoman.com/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 13 1999 - 09:13:05 PST