[htdig3-dev] Perl interface to database


Subject: [htdig3-dev] Perl interface to database
From: Warren Jones (wjones@tc.fluke.com)
Date: Tue Jan 18 2000 - 16:56:30 PST


I was disappointed when the Perl scripts in the contrib
directory were broken by changes in the database format.
Here's a first cut at a HtDig::Database module that attempts
to fill the gap, along with a couple of short scripts that
demonstrate how it can be used.

The way the module deals with URL encoding is still pretty
simple minded -- I haven't attempted to duplicate all the
logic in HtWordCodec.cc, but it should work for some of the
most common cases. It may be that a better long term
solution would be to interface to the code in HtWordCodec.cc
as an XSUB.

I understand that some of the HtDig module namespace has
already been staked out. Comments on the naming of this
module (or anything else) would be welcome.

--
Warren Jones

------------------------------ snip snip ------------------------------ #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of shell archive." # # Contents: Database.pm listconfig listdocdb # # Wrapped by wjones@addy on Tue Jan 18 16:26:45 2000 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'Database.pm' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'Database.pm'\" else echo shar: Extracting \"'Database.pm'\" \(9693 characters\) sed "s/^X//" >'Database.pm' <<'END_OF_FILE' Xpackage HtDig::Database; X X=head1 NAME X XHtdig::Database - Perl interface Ht://Dig docdb and config files X X=head1 SYNOPSIS X X use Htdig::Database; X X my $config = Htdig::Database::get_config( $config_file ) X or die "$0: Can't access $config_file\n"; X my $record = Htdig::Database::parse_docdb( $docdb_record ); X print "URL = $record->{URL}\n"; X X=head1 DESCRIPTION X X=head2 Exported functions X XThe following functions are provided by Htdig::Database: X X get_config X parse_docdb X encode_url X decode_url X XBy default, functions are not exported into the callers namespace, Xand you must invoke them using the full package name, e.g.: X X Htdig::Database::getconfig( $config_file ); X XTo import all available function names, invoke the module with: X X use Htdig::Database qw(:all); X X=head2 Parsing a config file X XC<get_config> parses a config file and returns a hash ref that Xcontains the configuration attributes. For example: X X my $config = Htdig::Database::get_config( $config_file ) X or die "$0: Can't access $config_file\n"; X print "start_url = $config->{start_url}\n"; X XAll values in the hash are scalars, and any items that are intended Xto be lists or booleans must be parsed by the calling program. XC<get_config> returns C<undef> if the config file can't be opened, Xand carps about various syntax errors. X X=head2 Parsing a record from the document database X XC<parse_docdb> parses a record from the document database Xand returns a hash ref. For example: X X my %docdb; X tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) || X die "$0: Unable to open $docdb: $!"; X X while ( my ( $key, $value ) = each %docdb ) { X next if $key =~ /^nextDocID/; X my %rec = Htdig::Database:parse_docdb( $value ); X print " URL: $record->{URL}\n"; X print "HOPCOUNT: $record->{HOPCOUNT}\n"; X } X XURL's in the database are encoded using two attributes from the Xconfiguration file: I<common_url_parts> and I<url_part_aliases>. XC<parse_docdb> does only rudimentary decoding. It can't Xhandle more than 25 elements in the I<common_url_parts> list, Xand it currently can't handle I<url_part_aliases> at all. X XC<get_config> caches the value of I<common_url_parts> that's Xused for decoding URL's, and should usually be called before XC<parse_docdb>. X XCompressed data in the HEAD element will be automatically decompressed Xif the Compress::Zlib module is available. If Compress::Zlib is not Xinstalled, compressed data will be silently replaced by the string: X X "Compressed data: Zlib not available" X XIf only a single value is needed from the database record, Xit can be specified as a second parameter to C<parse_docdb>, Xwhich then returns the requested value as a scalar. For example: X X my $url = Htdig::Database:parse_docdb( $value, 'URL' ); X X=head2 Encoding a URL X X my $encoded_url = Htdig::Database::encode_url( $url ); X XThis may be useful for computing database keys, since the keys Xare encoded URL's. C<get_config> should be called before C<encode_url> Xor C<decode_url> to initialize the value of C<common_url_parts>. X X=head2 Decoding a URL X X my $url = Htdig::Database::decode_url( $encoded_url ); X XThis should seldom be necessary, since URL's are normally Xdecoded by C<parse_docdb>. X X=head1 AUTHOR X XWarren Jones E<lt>F<wjones@halcyon.com>E<gt> X X=head1 BUGS X XOnly simple cases of URL encoding are handled correctly. XNo more than 25 elements are allowed in I<common_url_parts>. XThe value of I<url_part_aliases> is not used at all. XSomeday this module may implement the same URL encoding Xlogic found in F<HtWordCodec.cc>, but a better solution might Xbe to provide an XSUB interface to the C++ functions. X X=cut X X# $Id: Database.pm,v 1.1 2000/01/19 00:17:02 wjones Exp $ X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/Database.pm,v $ X Xrequire 5.000; Xuse Carp; Xuse Exporter; Xuse strict; Xuse vars qw( $VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS ); X X# Load Compress::Zlib if possible, but it's not X# an error if compression is not available. X Xif ( eval { require Compress::Zlib } ) { X import Compress::Zlib; X} X X# Constants used in URL encoding (see HtWordCodec.cc): X X$VERSION = 0.50; X@ISA = qw( Exporter ); X@EXPORT = (); X@EXPORT_OK = qw( get_config parse_docdb encode_url decode_url ); X%EXPORT_TAGS = ( 'all' => [ @EXPORT_OK ] ); X X# These strings are used as hash keys, and correspond to X# integer field codes in the docdb data structure. X# The order is important: X Xmy @fields = qw( X ID TIME ACCESSED STATE SIZE LINKS IMAGESIZE HOPCOUNT X URL HEAD TITLE DESCRIPTIONS ANCHORS EMAIL NOTIFICATION X SUBJECT STRING METADSC BACKLINKS SIG X); X X# These are the string fields and the list fields. X# All remaining fields are integers. X Xmy %string_fields = map { $_, 1 } qw( X URL HEAD TITLE EMAIL NOTIFICATION SUBJECT STRING METADSC X); X Xmy %list_fields = map { $_, 1 } qw( DESCRIPTIONS ANCHORS ); X X X# These variables are used by &encode_url and &decode_url: X Xuse constant FIRST_INTERNAL_SINGLECHAR => 7; Xuse constant LAST_INTERNAL_SINGLECHAR => 31; X Xmy ( @url_parts, %url_parts, $url_parts ); Xmy @default_url_parts = qw( X http:// http://www. ftp:// ftp://ftp. /pub/ X .html .gif .jpg .jpeg /index.html /index.htm X .com/ .com mailto: X); Xmy $matchchars = sprintf '[\0%o-\0%o]', FIRST_INTERNAL_SINGLECHAR, X LAST_INTERNAL_SINGLECHAR; Xmy $maxparts = LAST_INTERNAL_SINGLECHAR - X FIRST_INTERNAL_SINGLECHAR + 1; Xmy $warning = ''; X Xsub set_url_parts { # Setup for variables used by X # &encode_url and &decode_url X my $code = FIRST_INTERNAL_SINGLECHAR; X %url_parts = map { $_, chr($code++) } @url_parts = @_; X $url_parts = join '|', map { quotemeta($_) } @_; X $warning = "Too many common_url_parts: can't handle more than $maxparts.\n" X if $#url_parts > $maxparts; X} X Xset_url_parts( @default_url_parts ); # Initialize with defaults. X Xsub getnum { # Extract integer from doc doc record. X my ( $flags, $in ) = @_; X my ( $fmt, $length ) = ( 'I', 4 ); X ( $fmt, $length ) = ( 'C', 1 ) if $flags & 0100; X ( $fmt, $length ) = ( 'S', 2 ) if $flags & 0200; X $_[1] = substr($in,$length+1); X unpack($fmt,substr($in,1)); X} X Xsub getstring { # Extract string from doc record. X my $length = getnum( @_ ); X my $string = substr( $_[1], 0, $length ); X $_[1] = substr( $_[1], $length ); X return $string; X} X Xsub getlist { # Extract list from doc record. X my ( $flags, $in ) = @_; X my $count = getnum( $flags, $in ); X my @list = (); X for ( my $i=0; $i<$count; $i++ ) { X my $length = 255; X if ( $flags ) { X $length = unpack('C',$in); X $in = substr($in,1); X } X if ( $length > 253 ) { X $length = unpack('I',$in); X $in = substr($in,4); X } X push @list, substr($in,0,$length); X $in = substr($in,$length); X } X $_[1] = $in; X return \@list; X} X Xsub parse_docdb X{ X my $record = shift; X my %record = (); X while ( length($record) > 0 ) { X my $code = unpack('C', $record); X my $flags = $code & 0300; X $code &= 077; X if ( $code > $#fields ) { X carp "Invalid item code: $code"; X last; X } X my $field = $fields[$code]; X my $value; X if ( $list_fields{$field} ) { X $value = getlist($flags,$record); X } elsif ( $string_fields{$field} ) { X $value = getstring($flags,$record); X $value = decode_url($value) if $field eq 'URL'; X if ( $field eq 'HEAD' && substr($value,0,2) eq "x\234" ) { X if ( defined &inflateInit ) { X my ( $i, $zstatus ) = inflateInit(); X ( $value, $zstatus ) = $i->inflate($value); X } X else { X $value = 'Compressed data: Zlib not available'; X } X } X X } else { X $value = getnum($flags,$record); X } X return $value if $_[0] && $_[0] eq $field; X $record{$field} = $value; X } X return $_[0] ? '' : %record; X} X Xsub decode_url { X local($_) = shift; X if ( $warning ) { X carp $warning; X } X else { X s/$matchchars/$url_parts[ord($&)-&FIRST_INTERNAL_SINGLECHAR]/oeg; X } X $_; X} X Xsub encode_url { X local($_) = shift; X if ( $warning ) { X carp $warning; X } X else { X s/($url_parts)/$url_parts{$&}/eg; X } X $_; X} X Xsub get_config { X# X# The first argument is the name of an htdig config file. X# The second parameter, if present, is a hash ref that is X# to receive the config values. The second parameter is X# used only for recursive calls in the case of included files. X# A hash ref is returned if on success, or undef if the X# file cannot be opened. X# X my $file = shift; X my $config = shift || {}; X no strict 'refs'; X if ( ! open $file, $file ) { X return undef; X } X while ( <$file> ) { X $_ .= <$file> while s/\\\n/ / && ! eof($file); X next if /^\s*(#|$)/; X ( my $key, $_ ) = /^\s*(\w+)\s*:\s*(.*)/; X if ( ! $key ) { X carp "Syntax error in $file (line $.)"; X next; X } X s/\${(\w+)}/$config->{$1} || ''/ge; # variable substitution X s/`([^`]+)`/file_contents($1,$file)/ge; # file substitution X if ( $key eq 'include' ) { X $_ = "$1/$_" if ! m|^/| && $file =~ m|(.*)/|; X get_config( $_, $config ); X } X else { X $config->{$key} = $_; X } X } X close $file; X use strict 'refs'; X my $parts = $config->{common_url_parts}; X set_url_parts( defined($parts) ? split( ' ', $parts ) : X @default_url_parts ); X $warning .= "URL translation can't handle url_part_aliases.\n" X if $config->{url_part_aliases}; X return $config; X} X Xsub file_contents { X# X# Return the contents of a file, with newlines X# replaced by a single space. X# X if ( ! open FILE, $_[0] ) { X carp "Can't access included file $_[0] at $_[1] line $."; X return ''; X } X my @file_contents = map { chomp, $_ } <FILE>; X close FILE; X return "@file_contents"; X} X X1; END_OF_FILE if test 9693 -ne `wc -c <'Database.pm'`; then echo shar: \"'Database.pm'\" unpacked with wrong size! fi # end of 'Database.pm' fi if test -f 'listconfig' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'listconfig'\" else echo shar: Extracting \"'listconfig'\" \(631 characters\) sed "s/^X//" >'listconfig' <<'END_OF_FILE' X#! /usr/local/bin/perl -w X# X# listconfig -- list the contents of an ht://Dig configuration file X# X# Usage: listconfig config_file X# X# This primarily intended as a test and demo of the get_config X# function in the HtDig::Database module. X# X# $Id: listconfig,v 1.1 2000/01/19 00:17:02 wjones Exp $ X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listconfig,v $ X Xuse HtDig::Database qw(:all); Xuse strict; X Xmy $config_file = shift X or die "usage: $0 config_file\n"; X Xmy $config = get_config( $config_file ) X or die "$0: can't access $config_file\n"; X Xfor ( sort keys %$config ) { X print "$_:\t$config->{$_}\n"; X} X END_OF_FILE if test 631 -ne `wc -c <'listconfig'`; then echo shar: \"'listconfig'\" unpacked with wrong size! fi chmod +x 'listconfig' # end of 'listconfig' fi if test -f 'listdocdb' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'listdocdb'\" else echo shar: Extracting \"'listdocdb'\" \(2207 characters\) sed "s/^X//" >'listdocdb' <<'END_OF_FILE' X#! /usr/local/bin/perl -w X# X# listdocdb -- List ht://Dig document database X# X# Usage: listdocdb [-v [-v]] config_file [ docdb_file ] X# X# By default, only the URL's are listed from the database. X# With a single "-v" flag, other fields are also listed, X# including the first 60 characters of the HEAD excerpt. X# With two "-v" flags, the full text of the HEAD excerpt X# is shown. X# X# If the docdb file isn't specified on the command line, X# the path will be found in the config file, or if it X# can't be found there, a guess will be made based on X# the path to the config file. X# X# $Id: listdocdb,v 1.1 2000/01/19 00:17:02 wjones Exp $ X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listdocdb,v $ X Xuse DB_File; Xuse HtDig::Database qw(:all); Xuse strict; X Xmy $verbose = 0; X Xwhile ( @ARGV && $ARGV[0] eq '-v' ) { X $verbose++; X shift; X} X Xmy ( $config_file, $docdb ) = @ARGV; X Xdie "Usage: $0 [-v] config_file [docdb_file]\n" if ! $config_file; X Xmy $config = get_config( $config_file ) or X die "$0: Can't access $config_file\n"; X Xif ( ! $docdb ) { X X # If database file isn't specified on the command line, X # get if from the config file, or guess based on path X # to config file. X X my $database_base = $config->{database_base}; X if ( ! $database_base ) { X my $database_dir = $config->{database_dir}; X if ( ! $database_dir ) { X my ( $config_dir ) = ( $config_file =~ m|(.*)/| ); X $database_dir = $config_dir ? "$config_dir/../db" : "../db"; X } X $database_base = "$database_dir/db"; X } X $docdb = "$database_base.docdb"; X} X Xmy %docdb; Xtie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) || X die "$0: Unable to open $docdb: $!"; X Xwhile ( my ( $key, $value ) = each %docdb ) { X next if $key =~ /^nextDocID/; X if ( ! $verbose ) { X print decode_url( $key ), "\n"; X } X else { X my %rec = parse_docdb( $value ); X for ( sort keys %rec ) { X my $field = $rec{$_}; X $field = join( "\n\t\t", @$field ) if ref($field) eq 'ARRAY'; X $field = localtime( $field ) if /^(TIME|ACCESSED)$/; X $field = substr( $field, 0, 60 ) if /^HEAD$/ && $verbose < 2; X printf "%13s: %s\n", $_, $field; X } X print '='x60, "\n"; X } X} X END_OF_FILE if test 2207 -ne `wc -c <'listdocdb'`; then echo shar: \"'listdocdb'\" unpacked with wrong size! fi chmod +x 'listdocdb' # end of 'listdocdb' fi echo shar: End of shell archive. exit 0

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Jan 18 2000 - 16:57:01 PST