NAME

Jdb - a flat-text database for shell scripting

SYNOPSIS

JDB is package of commands for manipulating flat-ASCII databases from shell scripts. JDB is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database).

JDB is very good at doing things like:

extracting measurements from experimental output
examining data to address different hypotheses
joining data from different experiments
eliminating/detecting outliers
computing statistics on data (mean, confidence intervals, correlations, histograms)
reformatting data for graphing programs

Rather than hand-code scripts to do each special case, JDB provides higher-level functions. Although it's often easy throw together a custom script to do any single task, I believe that there are several advantages to using this library:

these programs provide a higher level interface than plain Perl, so

Fewer lines of simpler code:

    dbrow '_size == 1024' | dbcolstats bw

rather than:

    while (<>) { split; $sum+=$F[2]; $ss+=$F[2]^2; $n++; }
    $mean = $sum / $n; $std_dev = ...

in dozens of places.

the library uses names for columns, so
- No more $F[2], use _size.
- New or different order columns? No changes to your scripts!
A string of actions are self-documenting (each program records what it does).
- No more wondering what hacks were used to compute the final data, just look at the comments at the end of the output.
The library is mature, supporting large datasets, corner cases, error handling, backed by an automated test suite.
- No more puzzling about bad output because your custom script skimped on error checking.
- No more memory thrashing when you try to sort ten million records.
Jdb-2.x supports Perl scripting (in addition to shell scripting), with libraries to do Jdb input and output, and easy support for pipelines. The shell script
```
    dbcol name test1 | dbroweval '_test1 += 5;'
```
can be written in perl as:
```
    dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
```

(The disadvantage is that you need to learn what functions JDB provides.)

JDB is built on flat-ASCII databases. By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. To the best of my knowledge, the original implementation of this idea was /rdb, a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web page http://www.rdb.com/). JDB is an incompatible re-implementation of their idea without any accelerated indexing or forms support. (But it's free, and probably has better statistics!).

JDB-2.x supports threading and will exploit multiple processors or cores, and provides Perl-level support for input, output, and threaded-pipelines.

Installation instructions follow at the end of this document. JDB-2.x requires Perl 5.8 to run. All commands have manual pages and provide usage with the --help option. All commands are backed by an automated test suite.

The most recent version of JDB is available on the web at http://www.isi.edu/~johnh/SOFTWARE/JDB/index.html.

WHAT'S NEW

2.11, 14-Oct-08

Still in beta, but picking up some bug fixes.

ENHANCEMENT: html_table_to_db is now more agressive about filling in empty cells with the official empty value, rather than leaving them blank or as whitespace.
ENHANCEMENT: dbpipeline now catches failures during pipeline element setup and exits reasonably gracefully.
BUG FIX: dbsubprocess now reaps child prcoesses, thus avoiding running out of processes when used a lot.

README CONTENTS

executive summary
what's new
README CONTENTS
installation
basic data format
basic data manipulation
list of commands
another example
a gradebook example
a password example
history
related work
release notes
copyright
comments

INSTALLATION

Jdb now uses the standard Perl build and installation from ExtUtil::MakeMaker(3), so the quick answer to installation is to type:


    perl Makefile.PL
    make
    make test
    make install

Or, if you want to install it somewhere else, change the first line to

    perl Makefile.PL PREFIX=$HOME

and it will go in your home directory's bin, etc. (See the ExtUtil::MakeMaker(3) manpage for more details.)

JDB requires perl 5.8 or later and uses ithreads.

A test-suite is available, run it with

    make test

A FreeBSD port to JDB is available, see http://www.freshports.org/databases/jdb/.

A Fink (MacOS X) port is available, see http://pdb.finkproject.org/pdb/package.php/jdb. (Thanks to Lars Eggert for maintaining this port.)

BASIC DATA FORMAT

These programs are based on the idea storing data in simple ASCII files. A database is a file with one header line and then data or comment lines. For example:

        #h account passwd uid gid fullname homedir shell
        johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
        greg * 2275 134 Greg_Johnson /home/greg /bin/bash
        root * 0 0 Root /root /bin/bash
        # this is a simple database

The header line must be first and begins with #h. There are rows (records) and columns (fields), just like in a normal database. Comment lines begin with #.

By default, columns are delimited by whitespace. With this default configuration, the contents of a field cannot contain whitespace. However, this limitation can be relaxed by changing the field separator as described below.

The big advantage of simple flat-text databases is that it is usually easy to massage data into this format, and it's reasonably easy to take data out of this format into other (text-based) programs, like gnuplot, jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel and HTML if you prefer.)

Since no-whitespace in columns was a problem for some applications, there's an option which relaxes this rule. You can specify the field separator in the table header with -Fx where x is the new field separator. The special value -FS sets a separator of two spaces, thus allowing (single) spaces in fields. An example:

        #h -FS account passwd uid gid fullname homedir shell
        johnh  *  2274  134  John Heidemann  /home/johnh  /bin/bash
        greg  *  2275  134  Greg Johnson  /home/greg  /bin/bash
        root  *  0  0  Root  /root  /bin/bash
        # this is a simple database

See dbfilealter(1) for more details. Regardless of what the column separator is for the body of the data, it's always whitespace in the header.

There's also a third format: a "list". Because it's often hard to see what's columns past the first two, in list format each "column" is on a separate line. The programs dblistize and dbcolize convert to and from this format, and all programs work with either formats. The command

    dbfilealter -R C  < DATA/passwd.jdb

outputs:

        #L account passwd uid gid fullname homedir shell
        account:  johnh
        passwd:   *
        uid:      2274
        gid:      134
        fullname: John_Heidemann
        homedir:  /home/johnh
        shell:    /bin/bash
        
        account:  greg
        passwd:   *
        uid:      2275
        gid:      134
        fullname: Greg_Johnson
        homedir:  /home/greg
        shell:    /bin/bash
        
        account:  root
        passwd:   *
        uid:      0
        gid:      0
        fullname: Root
        homedir:  /root
        shell:    /bin/bash
        
        # this is a simple database
        #  | dblistize

See dbfilealter(1) for more details.

BASIC DATA MANIPULATION

A number of programs exist to manipulate databases. Complex functions can be made by stringing together commands with shell pipelines. For example, to print the home directories of everyone with ``john'' in their names, you would do:

        cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir

The output might be:

        #h      homedir
        /home/johnh
        /home/greg
        # this is a simple database
        #  | dbrow _fullname =~ /John/
        #  | dbcol homedir

(Notice that comments are appended to the output listing each command, providing an automatic audit log.)

In addition to typical database functions (select, join, etc.) there are also a number of statistical functions.

TALKING ABOUT COLUMNS

An advantage of JDB is that you can talk about columns by name (symbolically) rather than simply by their positions. So in the above example, dbcol homedir pulled out the home directory column, and dbrow '_fullname =~ /John/' matched against column fullname.

In general, you can use the name of the column listed on the #h line to identify it in most programs, and _name to identify it in code.

Some alternatives for flexibility:

Numeric values identify columns positionally, numbering from 0. So 0 or _0 is the first column, 1 is the second, etc.
In code, _last_columnname gets the value from columname's previous row.

See dbroweval(1) for more details about writing code.

LIST OF COMMANDS

Enough said. I'll summarize the commands, and then you can experiment. For a detailed description of each command, see a summary by running it with the argument --help (or -? if you prefer.) Full manual pages can be found by running the command with the argument --man, or running the Unix command man dbcol or whatever program you want.

TABLE CREATION

dbcolcreate: add columns to a database
dbcoldefine: set the column headings for a non-JDB file

TABLE MANIPULATION

dbcol: select columns from a table
dbrow: select rows from a table
dbsort: sort rows based on a set of columns
dbjoin: compute the natural join of two tables
dbcolrename: rename a column
dbcolmerge: merge two columns into one
dbcolsplittocols: split one column into two or more columns
dbcolsplittorows: split one column into multiple rows
dbfilevalidate: check that db file doesn't have some common errors

COMPUTATION AND STATISTICS

dbcolstats: compute statistics over a column (mean,etc.,optionally median)
dbmultistats: group rows by some key value, then compute stats (mean, etc.) over each group
dbmapreduce: group rows (map) and then apply an arbitrary function to each group (reduce)
dbrvstatdiff: compare two samples distributions (mean/conf interval/T-test)
dbcolmovingstats: computing moving statistics over a column of data
dbcolstatscores: compute Z-scores and T-scores over one column of data
dbcolpercentile: compute the rank or percentile of a column
dbcolhisto: compute histograms over a column of data
dbcolscorrelate: compute the coefficient of correlation over several columns
dbcolsregression: compute linear regression and correlation for two columns
dbrowaccumulate: compute a running sum over a column of data
dbrowcount: count the number of rows (a subset of dbstats)
dbrowdiff: compute differences between each row of a table
dbrowenumerate: number each row
dbroweval: run arbitrary Perl code on each row
dbrowuniq: count/eliminate identical rows (like Unix uniq(1))

OUTPUT CONTROL

dbcolneaten: pretty-print columns
dbfilealter: convert between column or list format, or change the column separator
dbfilestripcomments: remove comments from a table
dbformmail: generate a script that sends form mail based on each row

CONVERSIONS

(These programs convert data into jdb. See their web pages for details.)

cgi_to_db: http://stein.cshl.org/boulder/
combined_log_format_to_db: http://httpd.apache.org/docs/2.0/logs.html
html_table_to_db: HTML tables to jdb (assuming they're reasonably formatted).
kitrace_to_db: http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html
ns_to_db: http://mash-www.cs.berkeley.edu/ns/
tabdelim_to_db: spreadsheet tab-delimited files to db
tcpdump_to_db: (see man tcpdump(8) on any reasonable system)

(And out of jdb:)

db_to_csv: Comma-separated-value format from jdb.
db_to_html_table: simple conversion of JDB to html tables

STANDARD OPTIONS

Many programs have common options:

-? or --help: Show basic usage.
-c FRACTION or --confidence FRACTION: Specify confidence interval FRACTION (dbcolstats, dbmultistats, etc.)
-C S or --element-separator S: Specify column separator S (dbcolsplittocols, dbcolmerge).
-d or --debug: Enable debugging (may be repeated for greater effect in some cases).
-a or --include-non-numeric: Compute stats over all data (treating non-numbers as zeros). (By default, things that can't be treated as numbers are ignored for stats purposes)
-S or --pre-sorted: Assume the data is pre-sorted. May be repeated to disable verification (saving a small amount of work).
-e E or --empty E: give value E as the value for empty (null) records
-i I or --input I: Input data from file I.
-o O or --output O: Write data out to file O.
--nolog.: Skip logging the program in a trailing comment.

When giving Perl code (in dbrow and dbroweval) column names can be embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1) for examples.)

Most programs run in constant memory and use temporary files if necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce, dbmultistats, dbrowsplituniq.

ANOTHER EXAMPLE

Take the raw data in DATA/http_bandwidth, put a header on it (dbcoldefine size bw), took statistics of each category (dbmultistats size bw), pick out the relevant fields (dbcol size mean stddev pct_rsd), and you get:

        #h      size    mean    stddev  pct_rsd
        1024    1.4962e+06      2.8497e+05      19.047
        10240   5.0286e+06      6.0103e+05      11.952
        102400  4.9216e+06      3.0939e+05      6.2863
        #  | dbcoldefine size bw
        #  | /home/johnh/BIN/DB/dbmultistats size bw
        #  | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd

(The whole command was:

        cat DATA/http_bandwidth |
        dbcoldefine size |
        dbmultistats size bw |
        dbcol size mean stddev pct_rsd

all on one line.)

Then post-process them to get rid of the exponential notation by adding this to the end of the pipeline:

    dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'

(Actually, this step is no longer required since dbcolstats now uses a different default format.)

giving:

        #h      size    mean    stddev  pct_rsd
        1024     1496200          284970        19.047
        10240    5028600          601030        11.952
        102400   4921600          309390        6.2863
        #  | dbcoldefine size bw
        #  | dbmultistats size bw
        #  | dbcol size mean stddev pct_rsd
        #  | dbroweval   { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }

In a few lines, raw data is transformed to processed output.

Suppose you expect there is an odd distribution of results of one datapoint. JDB can easily produce a CDF (cumulative distribution function) of the data, suitable for graphing:

    cat DB/DATA/http_bandwidth | \
        dbcoldefine size bw | \
        dbrow '_size == 102400' | \
        dbcol bw | \
        dbsort -n bw | \
        dbrowenumerate | \
        dbcolpercentile count | \
        dbcol bw percentile | \
        xgraph

The steps, roughly: 1. get the raw input data and turn it into jdb format, 2. pick out just the relevant column (for efficiency) and sort it, 3. for each data point, assign a CDF percentage to it, 4. pick out the two columns to graph and show them

A GRADEBOOK EXAMPLE

The first commercial program I wrote was a gradebook, so here's how to do it with JDB.

Format your data like DATA/grades.

        #h name email id test1
        a a@ucla.edu 1 80
        b b@usc.edu 2 70
        c c@isi.edu 3 65
        d d@lmu.edu 4 90
        e e@caltech.edu 5 70
        f f@oxy.edu 6 90

Or if your students have spaces in their names, use -FS and two spaces to separate each column:

        #h -FS name email id test1
        alfred aho  a@ucla.edu  1  80
        butler lampson  b@usc.edu  2  70
        david clark  c@isi.edu  3  65
        constantine drovolis  d@lmu.edu  4  90
        debrorah estrin  e@caltech.edu  5  70
        sally floyd  f@oxy.edu  6  90

To compute statistics on an exam, do

        cat DATA/grades | dbstats test1 |dblistize

giving

        #L  ...
        mean:        77.5
        stddev:      10.84
        pct_rsd:     13.987
        conf_range:  11.377
        conf_low:    66.123
        conf_high:   88.877
        conf_pct:    0.95
        sum:         465
        sum_squared: 36625
        min:         65
        max:         90
        n:           6
        ...

To do a histogram:

        cat DATA/grades | dbcolhisto -n 5 -g test1

giving

        #h low histogram
        65      *
        70      **
        75
        80      *
        85
        90      **
        #  | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1

Now you want to send out grades to the students by e-mail. Create a form-letter (in the file test1.txt):

        To: _email (_name)
        From: J. Random Professor <jrp@usc.edu>
        Subject: test1 scores

        _name, your score on test1 was _test1.
        86+   A
        75-85 B
        70-74 C
        0-69  F

Generate the shell script that will send the mail out:

        cat DATA/grades | dbformmail test1.txt > test1.sh

And run it:

        sh <test1.sh

The last two steps can be combined:

        cat DATA/grades | dbformmail test1.txt | sh

but I like to keep a copy of exactly what I send.

At the end of the semester you'll want to compute grade totals and assign letter grades. Both fall out of dbroweval. For example, to compute weighted total grades with a 40% midterm/60% final where the midterm is 84 possible points and the final 100:

        dbcol -rv total |
        dbcolcreate total - |
        dbroweval '
                _total = .40 * _midterm/84.0 + .60 * _final/100.0;
                _total = sprintf("%4.2f", _total);
                if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' | 
        dbcolneaten

If you got the data originally from a spreadsheet, save it in "tab-delimited" format and convert it with tabdelim_to_db (run tabdelim_to_db -? for examples).

A PASSWORD EXAMPLE

To convert the Unix password file to db:

        cat /etc/passwd | sed 's/:/  /g'| \
                dbcoldefine -F S login password uid gid gecos home shell \
                >passwd.jdb

To convert the group file

        cat /etc/group | sed 's/:/  /g' | \
                dbcoldefine -F S group password gid members \
                >group.jdb

To show the names of the groups that div7-members are in (assuming DIV7 is in the gecos field):

        cat passwd.jdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
                dbjoin - group.jdb gid | dbcol login group

SHORT EXAMPLES

Which db programs are the most complicated (based on number of test cases)?

        ls TEST/*.cmd | \
                dbcoldefine test | \
                dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
                dbrowuniq -c | \
                dbsort -nr count | \
                dbcolneaten

(Answer: dbcolstats, then dbjoin and dbfilealter, then dbsort.)

Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?

        cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments

        cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments

Merging a the hw1 column from file hw1.jdb into grades.jdb assuing there's a common student id in column "id":

        dbcol id hw1 <hw1.jdb >t.jdb

        dbjoin -a -e - grades.jdb t.jdb id | \
            dbsort  name | \
            dbcolneaten >new_grades.jdb

Merging two jdb files with the same rows:

        cat file1.jdb file2.jdb >output.jdb

or if you want to clean things up a bit

        cat file1.jdb file2.jdb | dbstripextraheaders >output.jdb

or if you want to know where the data came from

        for i in 1 2
        do
                dbcolcreate source $i < file$i.jdb
        done >output.jdb

(assumes you're using a Bourne-shell compatible shell, not csh).

HISTORY

There have been three versions of JDB; jdb 1.0 is a complete re-write of the pre-1995 versions, and was distributed from 1995 to 2007. Jdb 2.0 is a significant re-write of the 1.x versions for reasons described below.

JDB (in its various forms) has been used extensively by its author since 1991. Since 1995 it's been used by two other researchers at UCLA and several at ISI. In February 1998 it was announced to the Internet. Since then it has found a few users, some outside where I work.

Jdb 2.0 Rationale

I've thought about jdb-2.0 for many years, but it was started in earnest in 2007. Jdb-2.0 has the following goals:

in-one-process processing: While jdb is great on the Unix command line as a pipeline between programs, it should also be possible to set it up to run in a single process. And if it does so, it should be able to avoid serializing and deserializing (converting to and from text) data between each module. (Accomplished in jdb-2.0: see dbpipeline, although still needs tuning.)
clean IO API: Jdb's roots go back to perl4 and 1991, so the jdb-1.x library is very, very crufty. More than just being ugly (but it was that too), this made things reading from one format file and writing to another the application's job, when it should be the library's. (Accomplished in jdb-1.15 and improved in 2.0: see the Jdb::IO manpage.)
normalized module APIs: Because jdb modules were added as needed over 10 years, sometimes the module APIs became inconsistent. (For example, the 1.x dbcolcreate required an empty value following the name of the new column, but other programs specify empty values with the -e argument.) We should smooth over these inconsistencies. (Accomplished as each module was ported in 2.0 through 2.7.)
everyone handles all input formats: Given a clean IO API, the distinction between "colized" and "listized" jdb files should go away. Any program should be able to read and write files in any format. (Accomplished in jdb-2.1.)

Jdb-2.0 preserves backwards compatibility where possible, but breaks it where necessary to accomplish the above goals. As of August 2008, jdb-2.7 is the preferred version.

Contributors

JDB includes code ported from Geoff Kuenning (Jdb::Support::TDistribution).

JDB contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond arkadig@dyna.com, Haobo Yu haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu.

JDB includes datasets contributed from NIST (DATA/nist_zarr13.jdb), from http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm, the NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1. Background and Data. The source is public domain, and reproduced with permission.

RELATED WORK

As stated in the introduction, JDB is an incompatible reimplementation of the ideas found in /rdb. By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. The original implementation of this idea was /rdb, a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web page http://www.rdb.com/).

In August, 2002 I found out Carlo Strozzi extended RDB with his package NoSQL http://www.linux.it/~carlos/nosql/. According to Mr. Strozzi, he implemented NoSQL in awk to avoid the Perl start-up of RDB. Although I haven't found Perl startup overhead to be a big problem on my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may want to evaluate his system. The Linux Journal has a description of NoSQL at http://www.linuxjournal.com/article/3294. It seems quite similar to JDB. Like /rdb, NoSQL supports indexing (not present in JDB). JDB appears to have richer support for statistics, and, as of JDB-2.x, its support for Perl threading may support faster performance (one-process, less serialization and deserialization).

RELEASE NOTES

Versions prior to 1.0 were released informally on my web page but were not announced.

0.0 1991

started for my own research use

0.1 26-May-94

first check-in to RCS

0.2 15-Mar-95

parts now require perl5

1.0, 22-Jul-97

adds autoconf support and a test script.

1.1, 20-Jan-98

support for double space field separators, better tests

1.2, 11-Feb-98

minor changes and release on comp.lang.perl.announce

1.3, 17-Mar-98

adds median and quartile options to dbstats
adds dmalloc_to_db converter
fixes some warnings
dbjoin now can run on unsorted input
fixes a dbjoin bug
some more tests in the test suite

1.4, 27-Mar-98

improves error messages (all should now report the program that makes the error)
fixed a bug in dbstats output when the mean is zero

1.5, 25-Jun-98

BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like dbstats
NEW dbcolstats computes zscores and tscores over a column
NEW dbcolscorrelate computes correlation coefficients between two columns
INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
BUG FIX all tests are now ``portable'' (previously some tests ran only on my system)
BUG FIX you no longer need to have the db programs in your path (fix arose from a discussion with Arkadi Gelfond)
BUG FIX installation no longer uses cp -f (to work on SunOS 4)

1.6, 24-May-99

NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp files if necessary)
NEW dbcolmovingstats does moving means over a series of data
NEW dbcol has a -v option to get all columns except those listed
NEW dbmultistats does quartitles and medians
NEW dbstripextraheaders now also cleans up bogus comments before the fist header
BUG FIX dbcolneaten works better with double-space-separated data

1.7, 5-Jan-00

NEW dbcolize now detects and rejects lines that contain embedded copies of the field separator
NEW configure tries harder to prevent people from improperly configuring/installing jdb
NEW tcpdump_to_db converter (incomplete)
NEW tabdelim_to_db converter: from spreadsheet tab-delimited files to db
NEW mailing lists for jdb are jdb-announce@heidemann.la.ca.us and jdb-talk@heidemann.la.ca.us: To subscribe to either, send mail to jdb-announce-request@heidemann.la.ca.us or jdb-talk-request@heidemann.la.ca.us with "subscribe" in the BODY of the message.
BUG FIX dbjoin used to produce incorrect output if there were extra, unmatched values in the 2nd table. Thanks to Graham Phillips for providing a test case.
BUG FIX the sample commands in the usage strings now all should explicitly include the source of data (typically from "cat foo.jdb |"). Thanks to Ya Xu for pointing out this documentation deficiency.
BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.

1.8, 28-Jun-00

BUG FIX header options are now preserved when writing with dblistize
NEW dbrowuniq now optionally checks for uniqueness only on certain fields
NEW dbrowsplituniq makes one pass through a file and splits it into separate files based on the given fields
NEW converter for "crl" format network traces
NEW anywhere you use arbitrary code (like dbroweval), _last_foo now maps to the last row's value for field _foo.
OPTIMIZATION comment processing slightly changed so that dbmultistats now is much faster on files with lots of comments (for example, ~100k lines of comments and 700 lines of data!) (Thanks to Graham Phillips for pointing out this performance problem.)
BUG FIX dbstats with median/quartiles now correctly handles singleton data points.

1.9, 6-Nov-00

NEW dbfilesplit, split a single input file into multiple output files (based on code contributed by Pavlin Radoslavov).
BUG FIX dbsort now works with perl-5.6

1.10, 10-Apr-01

BUG FIX dbstats now handles the case where there are more n-tiles than data
NEW dbstats now includes a -S option to optimize work on pre-sorted data (inspired by code contributed by Haobo Yu)
BUG FIX dbsort now has a better estimate of memory usage when run on data with very short records (problem detected by Haobo Yu)
BUG FIX cleanup of temporary files is slightly better

1.11, 2-Nov-01

BUG FIX dbcolneaten now runs in constant memory
NEW dbcolneaten now supports "field specifiers" that allow some control over how wide columns should be
OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly (inspired by "Information and Control in Gray-box Systems" by the Arpaci-Dusseau's at SOSP 2001)
INTERNAL t_distr now ported to perl5 module DbTDistr

1.12, 30-Oct-02

BUG FIX dbmultistats documentation typo fixed
NEW dbcolmultiscale
NEW dbcol has -r option for "relaxed error checking"
NEW dbcolneaten has new -e option to strip end-of-line spaces
NEW dbrow finally has a -v option to negate the test
BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check Scheaffer test cases)
BUG FIX some patches to run with Perl 5.8. Note: some programs (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like: "Use of uninitialized value in concatenation (.)" or "string at /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please ignore this until I figure out how to suppress it. (Thanks to Jerry Zhao for noticing perl-5.8 problems.)
BUG FIX fixed an autoconf problem where configure would fail to find a reasonable prefix (thanks to Fabio Silva for reporting the problem)
NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
NEW dblib now has a function dblib_text2html() that will do simple conversion of iso-8859-1 to HTML

1.13, 4-Feb-04

NEW jdb added to the freebsd ports tree http://www.freshports.org/databases/jdb/. Maintainer: larse@isi.edu
BUG FIX properly handle trailing spaces when data must be numeric (ex. dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu nxu@aludra.usc.edu.
NEW dbcolize error message improved (bug report from Terrence Brannon), and list format documented in the README.
NEW cgi_to_db converts CGI.pm-format storage to jdb list format
BUG FIX handle numeric synonyms for column names in dbcol properly
ENHANCEMENT "talking about columns" section added to README. Lack of documentation pointed out by Lars Eggert.
CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send mail, rather than sendmail (sendmail is still an option, but mail doesn't require running as root)
NEW on platforms that support it (i.e., with perl 5.8), jdb works fine with unicode
NEW dbfilevalidate: check a db file for some common errors

1.14, 24-Aug-06

ENHANCEMENT README cleanup
INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
NEW dbcolsplittorows split one column into multiple rows
NEW dbcolsregression compute linear regression and correlation for two columns
ENHANCEMENT cvs_to_db: better error handling, normalize field names, skip blank lines
ENHANCEMENT dbjoin now detects (and fails) if non-joined files have duplicate names
BUG FIX minor bug fixed in calculation of Student t-distributions (doesn't change any test output, but may have caused small errors)

1.15, 12-Nov-07

NEW jdb-1.14 added to the MacOS Fink system http://pdb.finkproject.org/pdb/package.php/jdb. (Thanks to Lars Eggert for maintaining this port.)
NEW Jdb::IO::Reader and Jdb::IO::Writer now provide reasonably clean OO I/O interfaces to Jdb files. Highly recommended if you use jdb directly from perl. In the fullness of time I expect to reimplement the entire thing using these APIs to replace the current dblib.pl which is still hobbled by its roots in perl4.
NEW dbmapreduce now implements a Google-style map/reduce abstraction, generalizing dbmultistats.
ENHANCEMENT jdb now uses the Perl build system (Makefile.PL, etc.), instead of autoconf. This change paves the way to better perl-5-style modularization, proper manual pages, input of both listize and colize format for every program, and world peace.
ENHANCEMENT dblib.pl is now moved to Jdb::Old.pm.
BUG FIX dbmultistats now propagates its format argument (-f). Bug and fix from Martin Lukac (thanks!).
ENHANCEMENT dbformmail documentation now is clearer that it doesn't send the mail, you have to run the shell script it writes. (Problem observed by Unkyu Park.)
ENHANCEMENT adapted to autoconf-2.61 (and then these changes were discarded in favor of The Perl Way.
BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
ENHANCEMENT dbmultistats can now optionally run with pre-grouped input in O(1) memory
ENHANCEMENT dbroweval -N was finally implemented (eat comments)

2.0, 25-Jan-08

2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)

ENHANCEMENT: shifting old programs to Perl modules, with the front-end program as just a wrapper. In the short-term, this change just means programs have real man pages. In the long-run, it will mean that one can run a pipeline in a single Perl program. So far: dbcol, dbroweval, the new dbrowcount. dbsort the new dbmerge, the old dbstats (renamed dbcolstats), dbcolrename, dbcolcreate,

NEW: the Jdb::Filter::dbpipeline manpage is an internal-only module that lets one use jdb commands from within perl (via threads).

It also provides perl function aliases for the internal modules, so a string of jdb commands in perl are nearly as terse as in the shell:

    use Jdb::Filter::dbpipeline qw(:all);
    dbpipeline(
        dbrow(qw(name test1)),
        dbroweval('_test1 += 5;')
    );

INCOMPATIBLE CHANGE: The old dbcolstats has been renamed dbcolstatscores. The new dbcolstats does the same thing as the old dbstats. This incompatibility is unfortunate but normalizes program names.

CHANGE: The new dbcolstats program always outputs - (the default empty value) for statistics it cannot compute (for example, standard deviation if there is only one row), instead of the old mix of - and "na".

INCOMPATIBLE CHANGE: The old dbcolstats program, now called dbcolstatscores, also has different arguments. The -t mean,stddev option is now --tmean mean --tstddev stddev. See dbcolstatscores for details.

INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the default value rather than requiring each column to have an initial constant value. To change the initial value, sue the new -e option.

NEW: dbrowcount counts rows, an almost-subset of dbcolstats's n output (except without differentiating numeric/non-numeric input), or the equivalent of dbstripcomments | wc -l.

NEW: dbmerge merges two sorted files. This functionality was previously embedded in dbsort.

INCOMPATIBLE CHANGE: dbjoin's -i option to include non-matches is now renamed -a, so as to not conflict with the new standard option -i for input file.

2.1, 6-Apr-08

2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs understand both listize and colize format

ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1: dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
ENHANCEMENT dbmerge now handles an arbitrary number of input files, not just exactly two.
NEW dbmerge2 is an internal routine that handles merging exactly two files.
INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather than assuming the first two arguments were tables (as in jdb-1).: The old dbjoin argument -i is now -a or <--type=outer>.; A minor change: comments in the source files for dbjoin are now intermixed with output rather than being delayed until the end.
ENHANCEMENT dbsort now no longer produces warnings when null values are passed to numeric comparisons.
BUG FIX dbroweval now once again works with code that lacks a trailing semicolon. (This bug fixes a regression from 1.15.)
INCOMPATIBLE CHANGE dbcolneaten's old -e option (to avoid end-of-line spaces) is now -E to avoid conflicts with the standard empty field argument.
INCOMPATIBLE CHANGE dbcolhisto's old -e option is now -E to avoid conflicts. And its -n, -s, and -w are now -N, -S, and -W to correspond.
NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with different options.
ENHANCEMENT The library routines Jdb::IO now understand both list-format and column-format data, so all converted programs can now automatically read either format. This capability was one of the milestone goals for 2.0, so yea!

2.2, 23-May-08

Release 2.2 is another 2.x alpha release. Now most of the commands are ported, but a few remain, and I plan one last incompatible change (to the file header) before 2.x final.

ENHANCEMENT: shifting more old programs to Perl modules. New in 2.2: dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq. dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows. dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate exists only as a front-end (command-line) program.
INCOMPATIBLE CHANGE: The following programs have been dropped from jdb-2.x: dbcoltighten, dbfilesplit, dbstripextraheaders, dbstripleadingspace.
NEW: combined_log_format_to_db to convert Apache logfiles
INCOMPATIBLE CHANGE: Options to dbrowdiff are now -B and -I, not -a and -i.
INCOMPATIBLE CHANGE: dbstripcomments is now dbfilestripcomments.
BUG FIXES: dbcolneaten better handles empty columns; dbcolhisto warning suppressed (actually a bug in high-bucket handling).
INCOMPATIBLE CHANGE: dbmultistats now requires a -k option in front of the key (tag) field, or if none is given, it will group by the first field (both like dbmapreduce).
KNOWN BUG: dbmultistats with quantile option doesn't work currently.
INCOMPATIBLE CHANGE: dbcoldiff is renamed dbrvstatdiff.
BUG FIXES: dbformmail was leaving its log message as a command, not a comment. Oops. No longer.

2.3, 27-May-08 (alpha)

Another alpha release, this one just to fix the critical dbjoin bug listed below (that happens to have blocked my MP3 jukebox :-).

BUG FIX: Dbsort no longer hangs if given an input file with no rows.
BUG FIX: Dbjoin now works with unsorted input coming from a pipeline (like stdin). Perl-5.8.8 has a bug (?) that was making this case fail---opening stdin in one thread, reading some, then reading more in a different thread caused an lseek which works on files, but fails on pipes like stdin. Go figure.
BUG FIX / KNOWN BUG: The dbjoin fix also fixed dbmultistats -q (it now gives the right answer). Although a new bug appeared, messages like: Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl interpreter: 0xa8350b8 during global destruction. So the dbmultistats_quartile test is still disabled.

2.4, 18-Jun-08

Another alpha release, mostly to fix minor usability problems in dbmapreduce and client functions.

ENHANCEMENT: dbrow now defaults to running user supplied code without warnings (as with jdb-1.x). Use --warnings or -w to turn them back on.
ENHANCEMENT: dbroweval can now write different format output than the input, using the -m option.
KNOWN BUG: dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string table refcount" and "Scalars leaked" when run with an external program as a reducer.; dbmultistats emits the warning "Attempt to free unreferenced scalar" when run with quartiles.; In each case the output is correct. I believe these can be ignored.
CHANGE: dbmapreduce no longer logs a line for each reducer that is invoked.

2.5, 24-Jun-08

Another alpha release, fixing more minor bugs in dbmapreduce and lossage in Jdb::IO.

ENHANCEMENT: dbmapreduce can now tolerate non-map-aware reducers that pass back the key column in put. It also passes the current key as the last argument to external reducers.
BUG FIX: the Jdb::IO::Reader manpage, correctly handle -header option again. (Broken since jdb-2.3.)

2.6, 11-Jul-08

Another alpha release, needed to fix DaGronk. One new port, small bug fixes, and important fix to dbmapreduce.

ENHANCEMENT: shifting more old programs to Perl modules. New in 2.2: dbcolpercentile.
INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed, use --rank to require ranking instead of -r. Also, --ascending and --descending can now be specified separately, both for --percentile and --rank.
BUG FIX: Sigh, the sense of the --warnings option in dbrow was inverted. No longer.
BUG FIX: I found and fixed the string leaks (errors like "Unbalanced string table refcount" and "Scalars leaked") in dbmapreduce and dbmultistats. (All IO::Handles in threads must be manually destroyed.)
BUG FIX: The -C option to specify the column separator in dbcolsplittorows now works again (broken since it was ported).

2.7, 30-Jul-08 beta

The beta release of jdb-2.x. Finally, all programs are ported. As statistics, the number of lines of non-library code doubled from 7.5k to 15.5k. The libraries are much more complete, going from 866 to 5164 lines. The overall number of programs is about the same, although 19 were dropped and 11 were added. The number of test cases has grown from 116 to 175. All programs are now in perl-5, no more shell scripts or perl-4. All programs now have manual pages.

Although this is a major step forward, I still expect to rename "jdb" to "fsdb".

ENHANCEMENT: shifting more old programs to Perl modules. New in 2.7: dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate. db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db, tcpdump_to_db, tabdelim_to_db, ns_to_db.
INCOMPATIBLE CHANGE: The following programs have been dropped from jdb-2.x: db2dcliff, dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come back, but seemed overly specialized. The following program dbrowsplituniq was dropped because it is superseded by dbmapreduce. dmalloc_to_db was dropped pending a test cases and examples.
ENHANCEMENT: dbfilevalidate now has a -c option to correct errors.
NEW: html_table_to_db provides the inverse of db_to_html_table.

2.8, 5-Aug-08

Change header format, preserving forwards compatibility.

BUG FIX: Complete editing pass over the manual, making sure it aligns with jdb-2.x.
SEMI-COMPATIBLE CHANGE: The header of jdb files has changed, it is now #fsdb, not #h (or #L) and parsing of -F and -R are also different. See dbfilealter for the new specification. The v1 file format will be read, compatibly, but not written.
BUG FIX: dbmapreduce now tolerates comments that preceed the first key, instead of failing with an error message.

2.9, 6-Aug-08

Still in beta; just a quick bug-fix for dbmapreduce.

ENHANCEMENT: dbmapreduce now generates plausible output when given no rows of input.

AUTHOR

John Heidemann, johnh@isi.edu

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

A copy of the GNU General Public License can be found in the file ``COPYING''.

2.10, 23-Sep-08

Still in beta, but picking up some bug fixes.

ENHANCEMENT: dbmapreduce now generates plausible output when given no rows of input.
ENHANCEMENT: dbroweval the warnings option was backwards; now corrected. As a result, warnings in user code now default off (like in jdb-1.x).
BUG FIX: dbcolpercentile now defaults to assuming the target column is numeric. The new option -N allows selectin of a non-numeric target.
BUG FIX: dbcolscorrelate now includes --sample and --nosample options to compute the sample or full population correlation coefficients. Thanks to Xue Cai for finding this bug.

COMMENTS and BUG REPORTS

Any comments about these programs should be sent to John Heidemann johnh@isi.edu.