Re: [htdig] 3.1.1: Does noindex_start, noindex_stop work?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 17 Mar 1999 17:13:05 -0600 (CST)


According to me, back in late February...
> According to Frank Richter:
> > Then I had by mistake an empty noindex_start: value in the conf file, oh
> > dear, no words were indexed at all (my error, but might be dangerous for
> > others too).
>
> Yes, you're right. The code should check for an empty string, and disable
> the feature if that's the case. Right now, it just does a strncmp()
> with a length of 0, which will always match. I think this should also
> use mystrncasecmp() instead, and mystrcasestr() to find the end, so that
> it won't care if the tags are upper or lower case. Objections?

Well, I didn't hear any objections, so here's the patch to make these fixes
to htdig/HTML.cc, as well as fix up the discrepancies in the documentation.
I'll be committing these to CVS shortly.

--- ./htdig/HTML.cc.skipendbug Wed Mar 17 16:11:52 1999
+++ ./htdig/HTML.cc Wed Mar 17 17:05:15 1999
@@ -125,9 +125,10 @@
       // Filter out section marked to be ignored for indexing.
       // This can contain any HTML.
       //
- if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+ if (*skip_start &&
+ mystrncasecmp((char *)position, skip_start, strlen(skip_start)) == 0)
         {
- q = (unsigned char*)strstr((char *)position, skip_end);
+ q = (unsigned char*)mystrcasestr((char *)position, skip_end);
           if (!q)
             *position = '\0'; // Rest of document will be skipped...
           else
--- ./htdoc/attrs.html.skipendbug Tue Feb 16 23:03:53 1999
+++ ./htdoc/attrs.html Wed Mar 17 16:21:55 1999
@@ -3433,7 +3433,7 @@
         <dl>
           <dt>
                 <strong><a name="noindex_start">noindex_start</a>,
- <a name="noindex_stop">noindex_stop</a></strong>
+ <a name="noindex_end">noindex_end</a></strong>
           </dt>
           <dd>
                 <dl>
@@ -3453,7 +3453,7 @@
                         <em>default:</em>
                   </dt>
                   <dd>
- &lt;!--htdig-noindex--&gt; &lt;!--/htdig-noindex--&gt;
+ &lt;!--htdig_noindex--&gt; &lt;!--/htdig_noindex--&gt;
                   </dd>
                   <dt>
                         <em>description:</em>
@@ -3468,14 +3468,14 @@
                         SCRIPT sections in 'uneditable' documents can be skipped; note how
                         noindex_start does not contain an ending &gt;: this allows for all SCRIPT
                         tags to be matched regardless of attributes defined (different types or
- languages).
+ languages). Note that the match for this string is case insensitive.
                   </dd>
                   <dt>
                         <em>example:</em>
                   </dt>
                   <dd>
                         noindex_start: &lt;SCRIPT<br>
- noindex_stop: &lt;/SCRIPT&gt;
+ noindex_end: &lt;/SCRIPT&gt;
                   </dd>
                 </dl>
           </dd>
--- ./htdoc/cf_byname.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byname.html Wed Mar 17 16:22:47 1999
@@ -105,8 +105,8 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#next_page_text">next_page_text</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#no_excerpt_text">no_excerpt_text</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#no_excerpt_show_top">no_excerpt_show_top</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_end">noindex_end</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_stop">noindex_stop</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#no_next_page_text">no_next_page_text</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#no_page_list_header">no_page_list_header</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#no_page_number_text">no_page_number_text</a><br>
--- ./htdoc/cf_byprog.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byprog.html Wed Mar 17 16:23:10 1999
@@ -56,8 +56,8 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#meta_description_factor">meta_description_factor</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_word_length">minimum_word_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_end">noindex_end</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#noindex_stop">noindex_stop</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#pdf_parser">pdf_parser</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#remove_default_doc">remove_default_doc</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#robotstxt_name">robotstxt_name</a><br>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Mar 19 1999 - 17:32:54 PST