Re: [htdig3-dev] 3.2 goals (was On vacation this weekend)


Joe R. Jah (jjah@cloud.ccsf.cc.ca.us)
Tue, 3 Aug 1999 21:22:24 -0700 (PDT)


On Tue, 3 Aug 1999, Geoff Hutchison wrote:

> Date: Tue, 3 Aug 1999 22:08:49 -0400 (EDT)
> From: Geoff Hutchison <ghutchis@wso.williams.edu>
> To: htdig3-dev@htdig.org
> Cc: htdig3-dev@htdig.org
> Subject: Re: [htdig3-dev] 3.2 goals (was On vacation this weekend)
>
>
> On Tue, 3 Aug 1999, Gilles Detillieux wrote:
>
> > When I e-mailed Mike Grommet about this a couple weeks ago, he said the
> > last patch he posted to the list was complete enough for his needs. I
>
> OK, it seems like there are some patches that need to be dug up and dusted
> off. I'm glad to merge in whatever is sent to me, but the patch queue is
> pretty slim right now. (That means if you haven't seen it so far, send it
> again.)

The forwarded message was sent by Mike Grommet in April.

Joe

-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        jjah@cloud.ccsf.cc.ca.us

---------- Forwarded message ---------- Date: Wed, 7 Apr 1999 15:04:51 -0500 From: mike grommet <mgrommet@insolwwb.net> To: htdig3-dev@htdig.org Cc: htdig3-dev@htdig.org Subject: RE: [htdig3-dev] either a bug, or my ignorance, both are definately possible :)

Geoff, I've run into some interesting details here. I'm going to forward a copy of this to the list as well. Sorry for the long message, but I wasnt quite sure how else to get the info across.

I thought I might fill you in with some details...

as the original note below would indicate, I was getting some really funky date information.

The major problem I am having occurs with the addition of Retriever::got_time(char *time) in the patch file you sent...

here is the routine you sent me:

Retriever::got_time(char *time) { time_t new_time; struct tm tm;

if (debug > 1) cout << "\ntime: " << time << endl;

// // As defined by the Dublin Core, this should be YYYY-MM-DD // In the future, we'll need to deal with the scheme portion // in case someone picks a different format. // if (mystrptime(time, "%Y-%m-%d", &tm)) {

#if HAVE_TIMEGM new_time = timegm(&tm); #else new_time = mytimegm(&tm); #endif current_time = new_time; }

// If we can't convert it, current_time stays the same and we get // the default--the date returned by the server... }

Ok, I hacked in some debug output to go echo out when debug options are on... here are the details ( it should be noted that the date in the mega tag is 2001-04-05 I am going through and printing out the individual values of new_time:

time: 2001-04-05 - this is from your original code hour: 1 - why 1? doesnt really matter tho min: 0 - checks fine here sec: 134799364 - WOW! thats a lot of seconds. month: 3 - this one is right day: 5 - here too year: 101 - just fine here othertime: 1121145364 ---- translates to sometime in July 2005 I think

Ok, well the seconds info just screams out at me so I've hacked the code abit to initialize new_time... here is my final routine (so far) with my debug code too: void Retriever::got_time(char *time) { time_t new_time; struct tm tm;

// added by me tm.tm_hour = 0; tm.tm_min = 0; tm.tm_sec = 0; tm.tm_mon = 0; tm.tm_mday = 1; tm.tm_year = 0;

if (debug > 1) cout << "\ntime: " << time << endl;

// // As defined by the Dublin Core, this should be YYYY-MM-DD // In the future, we'll need to deal with the scheme portion // in case someone picks a different format. // if (mystrptime(time, "%Y-%m-%d", &tm)) { if (debug > 1) { cout << "\nhour: " << tm.tm_hour << endl; cout << "\nmin: " << tm.tm_min << endl; cout << "\nsec: " << tm.tm_sec << endl; cout << "\nmonth: " << tm.tm_mon << endl; cout << "\nday: " << tm.tm_mday << endl; cout << "\nyear: " << tm.tm_year << endl; } //#if HAVE_TIMEGM new_time = timegm(&tm); //#else // new_time = mytimegm(&tm); //#endif current_time = new_time; if (debug > 1) cout << "\nothertime: " << current_time << endl; }

// If we can't convert it, current_time stays the same and we get // the default--the date returned by the server... }

Ok, and now here is the new output:

time: 2001-04-05

hour: 0

min: 0

sec: 0

month: 3

day: 5

year: 101

othertime: 986428800

Note, in my code above, I have disabled the mytimegm. Strangely enough, timegm and mytimegm do NOT return the same values, they are exactly 24 hours off from one another.

Which brings me to my second and more minor problem: when the list of results are displayed, instead of treating the time as UTC, as the rest of my code for search ranges does already, it seems that the code for outputting date information on search results: if (t) { struct tm *tm = localtime(&t); // strftime(buffer, sizeof(buffer), "%e-%h-%Y", tm); if (config.Boolean("iso_8601")) { strftime(buffer, sizeof(buffer), "%Y-%m-%d %H:%M:%S %Z", tm); } else { strftime(buffer, sizeof(buffer), "%D", tm); } *str << buffer; }

is using localtime instead. Just wondering if should be default or not... its definately a problem with my particular implementation, since date ranges are searched by using UTC time, so the date appears short by one day. If you think I should change my current search routines so that time zones are taken into account, I can probably do that, but I'm not sure exactly where to begin.

What do you think?

-----Original Message----- From: Mike Grommet [mailto:mgrommet@insolwwb.net] Sent: Tuesday, April 06, 1999 4:22 PM To: 'Geoff Hutchison' Subject: RE: [htdig3-dev] Search by date ranges: some success, a few more

Geoff, I patched my source, and am getting some really really weird results...

Ok, for instance, look at this link:

http://www.weaselweb.com/viewarchivenewssports.php3?db=newsarchive&idnum=1

Ok, if you view the source, you will see the meta tag for the date. piece of cake. Now, when I run htdig with debug codes, it echoes out the right date but its only the result of the meta name="date" contents, and nothing has been performed on it yet. the content is of the proper format, as far as I can tell.

Ok, now that you have seen this, go to this address

http://omega.insolwwb.net/htdig

search the news archive (on the left) and for keywords, enter arkansas.

You dont have to bother with a search range.

Ok, it will bring up 1 document. look at the date on the document:

http://www.weaselweb.com/viewarchivenewssports.php3?db=newsarchive&idnum=1 07/12/05, 6965 bytes

2005-07-12?????

where in the world is this coming from?

-----Original Message----- From: Geoff Hutchison [mailto:ghutchis@wso.williams.edu] Sent: Tuesday, April 06, 1999 2:31 PM To: mike grommet Subject: RE: [htdig3-dev] Search by date ranges: some success, a few more

On Tue, 6 Apr 1999, mike grommet wrote:

> My thoughts are to take a meta tag, named something like "Document-date" and > a value > just like the standard GMT time returned by a web server for a Last > Modification

There is already a standard for this, specified by the Dublin Core standard. The tag is named "DATE" and has the ISO-8601 format YYYY-MM-DD.

> Would you happen to have this code handy? It would be useful to me at least

Here you go... I should probably make this an option with something like 'use_doc_date' when I commit it.

Index: htdig/HTML.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/HTML.cc,v retrieving revision 1.39 diff -c -3 -p -r1.39 HTML.cc *** htdig/HTML.cc 1999/03/23 20:09:22 1.39 --- htdig/HTML.cc 1999/04/02 01:37:20 *************** HTML::do_tag(Retriever &retriever, Strin *** 841,846 **** --- 841,850 ---- { retriever.got_meta_email(conf["content"]); } + else if (mystrcasecmp(cache, "date") == 0) + { + retriever.got_time(conf["content"]); + } else if (mystrcasecmp(cache, "htdig-notification-date") == 0) { retriever.got_meta_notification(conf["content"]); Index: htdig/Retriever.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/Retriever.cc,v retrieving revision 1.39 diff -c -3 -p -r1.39 Retriever.cc *** htdig/Retriever.cc 1999/03/16 02:04:28 1.39 --- htdig/Retriever.cc 1999/04/02 01:37:20 *************** Retriever::RetrievedDocument(Document &d *** 543,548 **** --- 543,549 ---- current_ref = ref; current_anchor_number = 0; current_title = 0; + current_time = 0; current_head = 0; current_meta_dsc = 0;

*************** Retriever::RetrievedDocument(Document &d *** 565,571 **** // ref->DocHead(current_head); ref->DocMetaDsc(current_meta_dsc); ! ref->DocTime(doc.ModTime()); ref->DocTitle(current_title); ref->DocSize(doc.Length()); ref->DocAccessed(time(0)); --- 566,575 ---- // ref->DocHead(current_head); ref->DocMetaDsc(current_meta_dsc); ! if (current_time == 0) ! ref->DocTime(doc.ModTime()); ! else ! ref->DocTime(current_time); ref->DocTitle(current_title); ref->DocSize(doc.Length()); ref->DocAccessed(time(0)); *************** Retriever::got_title(char *title) *** 891,896 **** --- 895,930 ---- current_title = title; }

+ //************************************************************************** *** + // void Retriever::got_time(char *time) + // + void + Retriever::got_time(char *time) + { + time_t new_time; + struct tm tm; + + if (debug > 1) + cout << "\ntime: " << time << endl; + + // + // As defined by the Dublin Core, this should be YYYY-MM-DD + // In the future, we'll need to deal with the scheme portion + // in case someone picks a different format. + // + if (mystrptime(time, "%Y-%m-%d", &tm)) + { + #if HAVE_TIMEGM + new_time = timegm(&tm); + #else + new_time = mytimegm(&tm); + #endif + current_time = new_time; + } + + // If we can't convert it, current_time stays the same and we get + // the default--the date returned by the server... + }

//************************************************************************** *** // void Retriever::got_anchor(char *anchor) Index: htdig/Retriever.h =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/Retriever.h,v retrieving revision 1.9 diff -c -3 -p -r1.9 Retriever.h *** htdig/Retriever.h 1999/03/16 02:04:29 1.9 --- htdig/Retriever.h 1999/04/02 01:37:20 *************** public: *** 50,55 **** --- 50,56 ---- void got_word(char *word, int location, int heading); void got_href(URL &url, char *description); void got_title(char *title); + void got_time(char *time); void got_head(char *head); void got_meta_dsc(char *md); void got_anchor(char *anchor); *************** private: *** 75,80 **** --- 76,82 ---- String current_title; String current_head; String current_meta_dsc; + time_t current_time; int current_id; DocumentRef *current_ref; int current_anchor_number;

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Aug 03 1999 - 21:22:24 PDT