RE: [htdig3-dev] Perl interface to database


Subject: RE: [htdig3-dev] Perl interface to database
From: Ben Parker (ben@ocha.unon.org)
Date: Fri Feb 18 2000 - 02:56:16 PST


I have been doing some work to make whatsnew.pl work.

Warren's HtDig::Database posted on this list does almost all the heavy
lifting, thankfully, and apart from a glitch related to some
common_url_parts (files ending in .htm come out as .gif), the thing is
working. I know I am not the only one who was hoping to revive the facility,
so I'd like to give something back to the htdiggers.

I am wondering if I should assume that HtDig::Database is going to be a
established module, and just "use" it, (hopefully new versions/bug fixes
will come along, inclduing a idiot-proof make install along with James'
modules), or whether I should paste Warren's vital decoding the DB routines
straight into the new whatsnew.pl until the module namespace and the module
itself has settled down.

Cheers,

Ben

-----Original Message-----
From: Tillman, James [mailto:JamesTillman@fdle.state.fl.us]
Sent: Wednesday, January 19, 2000 7:20 PM
To: 'Warren Jones'; htdig3-dev@htdig.org
Subject: RE: [htdig3-dev] Perl interface to database

I'm currently working in the HtDig perl namespace, and here are the modules
I'm working on:

HtDig::Search
HtDig::Config
HtDig::Site

The Search module will be an XS link to the htsearch functionality of htdig.
If you have any interest in helping me with this one, let me know. The
other two are open source work I'm doing on "contract" with CoSource.com,
and are part of an administrative interface to the htdig configuration
files.

Jamie

> -----Original Message-----
> From: Warren Jones [mailto:wjones@tc.fluke.com]
> Sent: Tuesday, January 18, 2000 7:57 PM
> To: htdig3-dev@htdig.org
> Subject: [htdig3-dev] Perl interface to database
>
>
> I was disappointed when the Perl scripts in the contrib
> directory were broken by changes in the database format.
> Here's a first cut at a HtDig::Database module that attempts
> to fill the gap, along with a couple of short scripts that
> demonstrate how it can be used.
>
> The way the module deals with URL encoding is still pretty
> simple minded -- I haven't attempted to duplicate all the
> logic in HtWordCodec.cc, but it should work for some of the
> most common cases. It may be that a better long term
> solution would be to interface to the code in HtWordCodec.cc
> as an XSUB.
>
> I understand that some of the HtDig module namespace has
> already been staked out. Comments on the naming of this
> module (or anything else) would be welcome.
>
> --
> Warren Jones
>
> ------------------------------ snip snip
> ------------------------------
> #! /bin/sh
> # This is a shell archive. Remove anything before this line,
> then unpack
> # it by saving it into a file and typing "sh file". To
> overwrite existing
> # files, type "sh file -c". You can also feed this as
> standard input via
> # unshar, or by typing "sh <file", e.g.. If this archive is
> complete, you
> # will see the following message at the end:
> # "End of shell archive."
> #
> # Contents: Database.pm listconfig listdocdb
> #
> # Wrapped by wjones@addy on Tue Jan 18 16:26:45 2000
> PATH=/bin:/usr/bin:/usr/ucb ; export PATH
> if test -f 'Database.pm' -a "${1}" != "-c" ; then
> echo shar: Will not clobber existing file \"'Database.pm'\"
> else
> echo shar: Extracting \"'Database.pm'\" \(9693 characters\)
> sed "s/^X//" >'Database.pm' <<'END_OF_FILE'
> Xpackage HtDig::Database;
> X
> X=head1 NAME
> X
> XHtdig::Database - Perl interface Ht://Dig docdb and config files
> X
> X=head1 SYNOPSIS
> X
> X use Htdig::Database;
> X
> X my $config = Htdig::Database::get_config( $config_file )
> X or die "$0: Can't access $config_file\n";
> X my $record = Htdig::Database::parse_docdb( $docdb_record );
> X print "URL = $record->{URL}\n";
> X
> X=head1 DESCRIPTION
> X
> X=head2 Exported functions
> X
> XThe following functions are provided by Htdig::Database:
> X
> X get_config
> X parse_docdb
> X encode_url
> X decode_url
> X
> XBy default, functions are not exported into the callers namespace,
> Xand you must invoke them using the full package name, e.g.:
> X
> X Htdig::Database::getconfig( $config_file );
> X
> XTo import all available function names, invoke the module with:
> X
> X use Htdig::Database qw(:all);
> X
> X=head2 Parsing a config file
> X
> XC<get_config> parses a config file and returns a hash ref that
> Xcontains the configuration attributes. For example:
> X
> X my $config = Htdig::Database::get_config( $config_file )
> X or die "$0: Can't access $config_file\n";
> X print "start_url = $config->{start_url}\n";
> X
> XAll values in the hash are scalars, and any items that are intended
> Xto be lists or booleans must be parsed by the calling program.
> XC<get_config> returns C<undef> if the config file can't be opened,
> Xand carps about various syntax errors.
> X
> X=head2 Parsing a record from the document database
> X
> XC<parse_docdb> parses a record from the document database
> Xand returns a hash ref. For example:
> X
> X my %docdb;
> X tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
> X die "$0: Unable to open $docdb: $!";
> X
> X while ( my ( $key, $value ) = each %docdb ) {
> X next if $key =~ /^nextDocID/;
> X my %rec = Htdig::Database:parse_docdb( $value );
> X print " URL: $record->{URL}\n";
> X print "HOPCOUNT: $record->{HOPCOUNT}\n";
> X }
> X
> XURL's in the database are encoded using two attributes from the
> Xconfiguration file: I<common_url_parts> and I<url_part_aliases>.
> XC<parse_docdb> does only rudimentary decoding. It can't
> Xhandle more than 25 elements in the I<common_url_parts> list,
> Xand it currently can't handle I<url_part_aliases> at all.
> X
> XC<get_config> caches the value of I<common_url_parts> that's
> Xused for decoding URL's, and should usually be called before
> XC<parse_docdb>.
> X
> XCompressed data in the HEAD element will be automatically
> decompressed
> Xif the Compress::Zlib module is available. If Compress::Zlib is not
> Xinstalled, compressed data will be silently replaced by the string:
> X
> X "Compressed data: Zlib not available"
> X
> XIf only a single value is needed from the database record,
> Xit can be specified as a second parameter to C<parse_docdb>,
> Xwhich then returns the requested value as a scalar. For example:
> X
> X my $url = Htdig::Database:parse_docdb( $value, 'URL' );
> X
> X=head2 Encoding a URL
> X
> X my $encoded_url = Htdig::Database::encode_url( $url );
> X
> XThis may be useful for computing database keys, since the keys
> Xare encoded URL's. C<get_config> should be called before
> C<encode_url>
> Xor C<decode_url> to initialize the value of C<common_url_parts>.
> X
> X=head2 Decoding a URL
> X
> X my $url = Htdig::Database::decode_url( $encoded_url );
> X
> XThis should seldom be necessary, since URL's are normally
> Xdecoded by C<parse_docdb>.
> X
> X=head1 AUTHOR
> X
> XWarren Jones E<lt>F<wjones@halcyon.com>E<gt>
> X
> X=head1 BUGS
> X
> XOnly simple cases of URL encoding are handled correctly.
> XNo more than 25 elements are allowed in I<common_url_parts>.
> XThe value of I<url_part_aliases> is not used at all.
> XSomeday this module may implement the same URL encoding
> Xlogic found in F<HtWordCodec.cc>, but a better solution might
> Xbe to provide an XSUB interface to the C++ functions.
> X
> X=cut
> X
> X# $Id: Database.pm,v 1.1 2000/01/19 00:17:02 wjones Exp $
> X# $Source:
> /home/wjones/src/CVS.repo/htdig/local-additions/Database.pm,v $
> X
> Xrequire 5.000;
> Xuse Carp;
> Xuse Exporter;
> Xuse strict;
> Xuse vars qw( $VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS );
> X
> X# Load Compress::Zlib if possible, but it's not
> X# an error if compression is not available.
> X
> Xif ( eval { require Compress::Zlib } ) {
> X import Compress::Zlib;
> X}
> X
> X# Constants used in URL encoding (see HtWordCodec.cc):
> X
> X$VERSION = 0.50;
> X@ISA = qw( Exporter );
> X@EXPORT = ();
> X@EXPORT_OK = qw( get_config parse_docdb encode_url decode_url );
> X%EXPORT_TAGS = ( 'all' => [ @EXPORT_OK ] );
> X
> X# These strings are used as hash keys, and correspond to
> X# integer field codes in the docdb data structure.
> X# The order is important:
> X
> Xmy @fields = qw(
> X ID TIME ACCESSED STATE SIZE LINKS IMAGESIZE HOPCOUNT
> X URL HEAD TITLE DESCRIPTIONS ANCHORS EMAIL NOTIFICATION
> X SUBJECT STRING METADSC BACKLINKS SIG
> X);
> X
> X# These are the string fields and the list fields.
> X# All remaining fields are integers.
> X
> Xmy %string_fields = map { $_, 1 } qw(
> X URL HEAD TITLE EMAIL NOTIFICATION SUBJECT STRING METADSC
> X);
> X
> Xmy %list_fields = map { $_, 1 } qw( DESCRIPTIONS ANCHORS );
> X
> X
> X# These variables are used by &encode_url and &decode_url:
> X
> Xuse constant FIRST_INTERNAL_SINGLECHAR => 7;
> Xuse constant LAST_INTERNAL_SINGLECHAR => 31;
> X
> Xmy ( @url_parts, %url_parts, $url_parts );
> Xmy @default_url_parts = qw(
> X http:// http://www. ftp:// ftp://ftp. /pub/
> X .html .gif .jpg .jpeg /index.html /index.htm
> X .com/ .com mailto:
> X);
> Xmy $matchchars = sprintf '[\0%o-\0%o]', FIRST_INTERNAL_SINGLECHAR,
> X LAST_INTERNAL_SINGLECHAR;
> Xmy $maxparts = LAST_INTERNAL_SINGLECHAR -
> X FIRST_INTERNAL_SINGLECHAR + 1;
> Xmy $warning = '';
> X
> Xsub set_url_parts { # Setup for variables used by
> X # &encode_url and &decode_url
> X my $code = FIRST_INTERNAL_SINGLECHAR;
> X %url_parts = map { $_, chr($code++) } @url_parts = @_;
> X $url_parts = join '|', map { quotemeta($_) } @_;
> X $warning = "Too many common_url_parts: can't handle more
> than $maxparts.\n"
> X if $#url_parts > $maxparts;
> X}
> X
> Xset_url_parts( @default_url_parts ); # Initialize with defaults.
> X
> Xsub getnum { # Extract integer from doc doc record.
> X my ( $flags, $in ) = @_;
> X my ( $fmt, $length ) = ( 'I', 4 );
> X ( $fmt, $length ) = ( 'C', 1 ) if $flags & 0100;
> X ( $fmt, $length ) = ( 'S', 2 ) if $flags & 0200;
> X $_[1] = substr($in,$length+1);
> X unpack($fmt,substr($in,1));
> X}
> X
> Xsub getstring { # Extract string from
> doc record.
> X my $length = getnum( @_ );
> X my $string = substr( $_[1], 0, $length );
> X $_[1] = substr( $_[1], $length );
> X return $string;
> X}
> X
> Xsub getlist { # Extract list from doc record.
> X my ( $flags, $in ) = @_;
> X my $count = getnum( $flags, $in );
> X my @list = ();
> X for ( my $i=0; $i<$count; $i++ ) {
> X my $length = 255;
> X if ( $flags ) {
> X $length = unpack('C',$in);
> X $in = substr($in,1);
> X }
> X if ( $length > 253 ) {
> X $length = unpack('I',$in);
> X $in = substr($in,4);
> X }
> X push @list, substr($in,0,$length);
> X $in = substr($in,$length);
> X }
> X $_[1] = $in;
> X return \@list;
> X}
> X
> Xsub parse_docdb
> X{
> X my $record = shift;
> X my %record = ();
> X while ( length($record) > 0 ) {
> X my $code = unpack('C', $record);
> X my $flags = $code & 0300;
> X $code &= 077;
> X if ( $code > $#fields ) {
> X carp "Invalid item code: $code";
> X last;
> X }
> X my $field = $fields[$code];
> X my $value;
> X if ( $list_fields{$field} ) {
> X $value = getlist($flags,$record);
> X } elsif ( $string_fields{$field} ) {
> X $value = getstring($flags,$record);
> X $value = decode_url($value) if $field eq 'URL';
> X if ( $field eq 'HEAD' && substr($value,0,2) eq "x\234" ) {
> X if ( defined &inflateInit ) {
> X my ( $i, $zstatus ) = inflateInit();
> X ( $value, $zstatus ) = $i->inflate($value);
> X }
> X else {
> X $value = 'Compressed data: Zlib not available';
> X }
> X }
> X
> X } else {
> X $value = getnum($flags,$record);
> X }
> X return $value if $_[0] && $_[0] eq $field;
> X $record{$field} = $value;
> X }
> X return $_[0] ? '' : %record;
> X}
> X
> Xsub decode_url {
> X local($_) = shift;
> X if ( $warning ) {
> X carp $warning;
> X }
> X else {
> X
> s/$matchchars/$url_parts[ord($&)-&FIRST_INTERNAL_SINGLECHAR]/oeg;
> X }
> X $_;
> X}
> X
> Xsub encode_url {
> X local($_) = shift;
> X if ( $warning ) {
> X carp $warning;
> X }
> X else {
> X s/($url_parts)/$url_parts{$&}/eg;
> X }
> X $_;
> X}
> X
> Xsub get_config {
> X#
> X# The first argument is the name of an htdig config file.
> X# The second parameter, if present, is a hash ref that is
> X# to receive the config values. The second parameter is
> X# used only for recursive calls in the case of included files.
> X# A hash ref is returned if on success, or undef if the
> X# file cannot be opened.
> X#
> X my $file = shift;
> X my $config = shift || {};
> X no strict 'refs';
> X if ( ! open $file, $file ) {
> X return undef;
> X }
> X while ( <$file> ) {
> X $_ .= <$file> while s/\\\n/ / && ! eof($file);
> X next if /^\s*(#|$)/;
> X ( my $key, $_ ) = /^\s*(\w+)\s*:\s*(.*)/;
> X if ( ! $key ) {
> X carp "Syntax error in $file (line $.)";
> X next;
> X }
> X s/\${(\w+)}/$config->{$1} || ''/ge; # variable substitution
> X s/`([^`]+)`/file_contents($1,$file)/ge; # file substitution
> X if ( $key eq 'include' ) {
> X $_ = "$1/$_" if ! m|^/| && $file =~ m|(.*)/|;
> X get_config( $_, $config );
> X }
> X else {
> X $config->{$key} = $_;
> X }
> X }
> X close $file;
> X use strict 'refs';
> X my $parts = $config->{common_url_parts};
> X set_url_parts( defined($parts) ? split( ' ', $parts ) :
> X @default_url_parts );
> X $warning .= "URL translation can't handle url_part_aliases.\n"
> X if $config->{url_part_aliases};
> X return $config;
> X}
> X
> Xsub file_contents {
> X#
> X# Return the contents of a file, with newlines
> X# replaced by a single space.
> X#
> X if ( ! open FILE, $_[0] ) {
> X carp "Can't access included file $_[0] at $_[1] line $.";
> X return '';
> X }
> X my @file_contents = map { chomp, $_ } <FILE>;
> X close FILE;
> X return "@file_contents";
> X}
> X
> X1;
> END_OF_FILE
> if test 9693 -ne `wc -c <'Database.pm'`; then
> echo shar: \"'Database.pm'\" unpacked with wrong size!
> fi
> # end of 'Database.pm'
> fi
> if test -f 'listconfig' -a "${1}" != "-c" ; then
> echo shar: Will not clobber existing file \"'listconfig'\"
> else
> echo shar: Extracting \"'listconfig'\" \(631 characters\)
> sed "s/^X//" >'listconfig' <<'END_OF_FILE'
> X#! /usr/local/bin/perl -w
> X#
> X# listconfig -- list the contents of an ht://Dig configuration file
> X#
> X# Usage: listconfig config_file
> X#
> X# This primarily intended as a test and demo of the get_config
> X# function in the HtDig::Database module.
> X#
> X# $Id: listconfig,v 1.1 2000/01/19 00:17:02 wjones Exp $
> X# $Source:
> /home/wjones/src/CVS.repo/htdig/local-additions/listconfig,v $
> X
> Xuse HtDig::Database qw(:all);
> Xuse strict;
> X
> Xmy $config_file = shift
> X or die "usage: $0 config_file\n";
> X
> Xmy $config = get_config( $config_file )
> X or die "$0: can't access $config_file\n";
> X
> Xfor ( sort keys %$config ) {
> X print "$_:\t$config->{$_}\n";
> X}
> X
> END_OF_FILE
> if test 631 -ne `wc -c <'listconfig'`; then
> echo shar: \"'listconfig'\" unpacked with wrong size!
> fi
> chmod +x 'listconfig'
> # end of 'listconfig'
> fi
> if test -f 'listdocdb' -a "${1}" != "-c" ; then
> echo shar: Will not clobber existing file \"'listdocdb'\"
> else
> echo shar: Extracting \"'listdocdb'\" \(2207 characters\)
> sed "s/^X//" >'listdocdb' <<'END_OF_FILE'
> X#! /usr/local/bin/perl -w
> X#
> X# listdocdb -- List ht://Dig document database
> X#
> X# Usage: listdocdb [-v [-v]] config_file [ docdb_file ]
> X#
> X# By default, only the URL's are listed from the database.
> X# With a single "-v" flag, other fields are also listed,
> X# including the first 60 characters of the HEAD excerpt.
> X# With two "-v" flags, the full text of the HEAD excerpt
> X# is shown.
> X#
> X# If the docdb file isn't specified on the command line,
> X# the path will be found in the config file, or if it
> X# can't be found there, a guess will be made based on
> X# the path to the config file.
> X#
> X# $Id: listdocdb,v 1.1 2000/01/19 00:17:02 wjones Exp $
> X# $Source:
> /home/wjones/src/CVS.repo/htdig/local-additions/listdocdb,v $
> X
> Xuse DB_File;
> Xuse HtDig::Database qw(:all);
> Xuse strict;
> X
> Xmy $verbose = 0;
> X
> Xwhile ( @ARGV && $ARGV[0] eq '-v' ) {
> X $verbose++;
> X shift;
> X}
> X
> Xmy ( $config_file, $docdb ) = @ARGV;
> X
> Xdie "Usage: $0 [-v] config_file [docdb_file]\n" if ! $config_file;
> X
> Xmy $config = get_config( $config_file ) or
> X die "$0: Can't access $config_file\n";
> X
> Xif ( ! $docdb ) {
> X
> X # If database file isn't specified on the command line,
> X # get if from the config file, or guess based on path
> X # to config file.
> X
> X my $database_base = $config->{database_base};
> X if ( ! $database_base ) {
> X my $database_dir = $config->{database_dir};
> X if ( ! $database_dir ) {
> X my ( $config_dir ) = ( $config_file =~ m|(.*)/| );
> X $database_dir = $config_dir ? "$config_dir/../db" : "../db";
> X }
> X $database_base = "$database_dir/db";
> X }
> X $docdb = "$database_base.docdb";
> X}
> X
> Xmy %docdb;
> Xtie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
> X die "$0: Unable to open $docdb: $!";
> X
> Xwhile ( my ( $key, $value ) = each %docdb ) {
> X next if $key =~ /^nextDocID/;
> X if ( ! $verbose ) {
> X print decode_url( $key ), "\n";
> X }
> X else {
> X my %rec = parse_docdb( $value );
> X for ( sort keys %rec ) {
> X my $field = $rec{$_};
> X $field = join( "\n\t\t", @$field ) if ref($field)
> eq 'ARRAY';
> X $field = localtime( $field ) if /^(TIME|ACCESSED)$/;
> X $field = substr( $field, 0, 60 ) if /^HEAD$/ &&
> $verbose < 2;
> X printf "%13s: %s\n", $_, $field;
> X }
> X print '='x60, "\n";
> X }
> X}
> X
> END_OF_FILE
> if test 2207 -ne `wc -c <'listdocdb'`; then
> echo shar: \"'listdocdb'\" unpacked with wrong size!
> fi
> chmod +x 'listdocdb'
> # end of 'listdocdb'
> fi
> echo shar: End of shell archive.
> exit 0
>
> ------------------------------------
> To unsubscribe from the htdig3-dev mailing list, send a message to
> htdig3-dev-unsubscribe@htdig.org
> You will receive a message to confirm this.
>
>

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Feb 18 2000 - 03:00:45 PST