htdig: Patch: support META elements for external parsers.


Hans-Peter Nilsson (hans-peter.nilsson@axis.com)
Thu, 14 Jan 1999 03:50:46 +0100


Here's an implementation of META elements for external parsers;
'm' was used for this. Nothing really new; most was stolen from
HTML.cc (no, I could not find a good way to share that code
within limits).

Note that meta.html is not up-to-date (regardless of this).
I did not fix that; I see it as a bug that can be fixed during
the feature-freeze (schemes within schemes :-)

Thu Jan 14 03:16:15 1999 Hans-Peter Nilsson <hp@axis.se>

        * htdig/ExternalParser.cc (parse): Added support for 'm': meta
        element.
        * htdoc/attrs.html: Document it.

Index: htdig/ExternalParser.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/ExternalParser.cc,v
retrieving revision 1.4
diff -p -c -r1.4 ExternalParser.cc
*** ExternalParser.cc 1998/12/06 18:46:59 1.4
--- ExternalParser.cc 1999/01/14 02:36:20
*************** static char RCSid[] = "$Id: ExternalPars
*** 30,35 ****
--- 30,36 ----
  #include <Dictionary.h>
  #include <ctype.h>
  #include <stdio.h>
+ #include <good_strtok.h>
  
  static Dictionary *parsers = 0;
  extern String configFile;
*************** ExternalParser::parse(Retriever &retriev
*** 153,158 ****
--- 154,162 ----
          return;
      }
  
+ unsigned int minimum_word_length
+ = config.Value("minimum_word_length", 3);
+
      String line;
      char *token1, *token2, *token3;
      URL url;
*************** ExternalParser::parse(Retriever &retriev
*** 209,214 ****
--- 213,328 ----
                  token1 = strtok(0, "\t");
                  if (token1 != NULL)
                    retriever.got_image(token1);
+ else
+ cerr<< "External parser error in line:"<<line<<"\n";
+ break;
+ case 'm': // meta
+ // Using good_strtok means we can accept empty
+ // fields.
+ char *httpEquiv = good_strtok(token1+2, "\t");
+ char *name = good_strtok(0, "\t");
+ char *content = good_strtok(0, "\t");
+
+ if (httpEquiv != NULL && name != NULL && content != NULL)
+ {
+ // It would be preferable if we could share
+ // this part with HTML.cc, but it has other
+ // chores too, and I do not se a point where to
+ // split it up to get a common shared function
+ // (or class). Which should not stop anybody from
+ // finding a better solution.
+ // For now, there is duplicated code.
+ StringMatch keywordsMatch;
+ String keywordNames = config["keywords_meta_tag_names"];
+
+ keywordNames.replace(' ', '|');
+ keywordNames.remove(",\t\r\n");
+ keywordsMatch.IgnoreCase();
+ keywordsMatch.Pattern(keywordNames);
+
+ // <URL:http://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5>
+ // says that the "name" attribute defaults to
+ // the http-equiv attribute if empty.
+ if (*name == '\0')
+ name = httpEquiv;
+
+ if (*httpEquiv != '\0')
+ {
+ // <META HTTP-EQUIV=REFRESH case
+ if (mystrcasecmp(httpEquiv, "refresh") == 0
+ && *content != '\0')
+ {
+ char *q = mystrcasestr(content, "url=");
+ if (q && *q)
+ {
+ q += 4; // skiping "URL="
+ char *qq = q;
+ while (*qq && (*qq != ';') && (*qq != '"') &&
+ !isspace(*qq))qq++;
+ *qq = 0;
+ URL href(q, base);
+ // I don't know why anyone would do this, but hey...
+ retriever.got_href(href, "");
+ }
+ }
+ }
+
+ //
+ // Now check for <meta name=... content=...> tags that
+ // fly with any reasonable DTD out there
+ //
+ if (*name != '\0' && *content != '\0')
+ {
+ if (keywordsMatch.CompareWord(name))
+ {
+ char *w = strtok(content, " ,\t\r");
+ while (w)
+ {
+ if (strlen(w) >= minimum_word_length)
+ retriever.got_word(w, 1, 10);
+ w = strtok(0, " ,\t\r");
+ }
+ }
+ else if (mystrcasecmp(name, "htdig-email") == 0)
+ {
+ retriever.got_meta_email(content);
+ }
+ else if (mystrcasecmp(name, "htdig-notification-date") == 0)
+ {
+ retriever.got_meta_notification(content);
+ }
+ else if (mystrcasecmp(name, "htdig-email-subject") == 0)
+ {
+ retriever.got_meta_subject(content);
+ }
+ else if (mystrcasecmp(name, "description") == 0
+ && strlen(content) != 0)
+ {
+ //
+ // We need to do two things. First grab the description
+ //
+ String meta_dsc = content;
+
+ if (meta_dsc.length() > max_meta_description_length)
+ meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
+ if (debug > 1)
+ cout << "META Description: " << content << endl;
+ retriever.got_meta_dsc(meta_dsc);
+
+ //
+ // Now add the words to the word list
+ // (slot 11 is the new slot for this)
+ //
+ char *w = strtok(content, " \t\r");
+ while (w)
+ {
+ if (strlen(w) >= minimum_word_length)
+ retriever.got_word(w, 1, 11);
+ w = strtok(0, " \t\r");
+ }
+ }
+ }
+ }
                  else
                    cerr<< "External parser error in line:"<<line<<"\n";
                  break;
Index: htdoc/attrs.html
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdoc/attrs.html,v
retrieving revision 1.15
diff -p -c -r1.15 attrs.html
*** attrs.html 1999/01/14 01:19:25 1.15
--- attrs.html 1999/01/14 02:36:25
***************
*** 1277,1284 ****
              The external parser is to write information for
              htdig on its standard output.<br>
               The output consists of records, each record terminated
! with a newline. Each record is a series of non-empty tab
! separated fields. The first field is a single character
              that specifies the record type. The rest of the fields
              are determined by the record type.
              <table border="1">
--- 1277,1285 ----
              The external parser is to write information for
              htdig on its standard output.<br>
               The output consists of records, each record terminated
! with a newline. Each record is a series of (unless
! expressively allowed to be empty) non-empty tab-separated
! fields. The first field is a single character
              that specifies the record type. The rest of the fields
              are determined by the record type.
              <table border="1">
***************
*** 1467,1472 ****
--- 1468,1504 ----
                    the document.
                  </td>
                </tr>
+ <tr>
+ <th rowspan="3" valign="top">
+ m
+ </th>
+ <td valign="top">
+ http-equiv
+ </td>
+ <td>
+ The HTTP-EQUIV attribute of a <a
+ href="meta.html"><i>META</i> tag</a>.
+ May be empty.
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">
+ name
+ </td>
+ <td>
+ The NAME attribute of this <i>META</i>
+ tag</a>. May be empty.
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">
+ contents
+ </td>
+ <td>
+ The CONTENTS attribute of this <i>META</i> tag</a>.
+ May be empty.
+ </td>
+ </tr>
              </table>
            </dd>
            <dt>

brgds, H-P
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Thu Jan 14 1999 - 08:17:18 PST