Sunday, January 25, 2009

The Data Dictionary and Calculations, Part 2

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Simple Case Is Not Much Help

We will begin by examining a simple case of a shopping cart. We have columns QTY and PRICE, and we want to add the column EXTENDED_PRICE that will contain PRICE * QTY. Our dictionary might look something like this:

table orderlines:
    ...details...
    
    column price:
        # see last week's essay for details on FETCH
        automation_id: FETCH
        auto_formula: items.price
    
    # A user-entered value, no automation
    column quantity:
    
    # The extended price:
    column extended_price:
        automation_id: extend
        auto_formula: price * qty

This seems simple enough, we have specified the formula right in the table definition, and now we are free to make use of that formula in any way we want -- either by generating code or interpreting it at run-time.

Unfortunately it is a bad idea to start coding right now with this example. The problem is that it is too simple, and will lead us down paths that cause great problems when we hit more complex cases. We must begin with a more complex case before we consider how to use this formula in our framework.

Not All Items Are Taxable

Consider the case where you have a shopping cart online and you must collect sales tax, but not all items are taxable. This means you need conditional logic of some sort, you must look at a flag and then decide whether to add tax. Here is a first stab at what this might look like:

table orderlines:
    ...details....
    
    # We'll skip the details on these columns for now
    column price:    
    column quantity:
    column flag_taxable:
    column taxrate:
    
    # We need to know the extended amount, that is
    # what we will tax
    column extended_amount:
        automation_id: extend
        auto_formula: price * qty
    
    # Here is the column that matters
    column tax:
        automation_id: extend
        auto_formula: CASE WHEN flag_taxable = 'Y'
                           THEN taxrate * extended_amount
                           ELSE 0 END

While this looks like a simple enough extension to the first example, it gets us into a thorny decision, the decision between parsing and assembling

Parse Versus Assemble

Before I get into the parse vs. assemble, question, let me pull back and explain why the example bothers me, and why it is worth an entire essay to discuss. In short, we intend to use the dictionary to implement a radical form of DRY - Don't Repeat Yourself (see The Wikipedia article on DRY.) Once we have specified the formula in the dictionary, we want to use it for all code generation and docs generation at very least, but we may also want to refer to the formulas in Java code (or PHP, Python, Ruby etc.) or even in Javascript code on the browser.

But the problem with the example is that it is coded in SQL. In the form I presented, it can be used for generating triggers, but not for anything else, unless you intend to use a parser to split it all up into pieces that can be reassembled for different presentations. The example as written is useful only for a single purpose -- but everything in our dictionary ought to be useful at any layer in the framework for any purpose.

But it gets worse. What if the programmer uses a dialect of SQL aimed for one platform that does not work on another? To guarantee cross-server compatibility, we not only have to parse the phrase, but then re-assemble it.

There is a third argument against the use of SQL expressions. We may be able to parse the expression and satisfy ourselves that it is valid, but that still does not mean it will work -- it may refer to non-existent columns or require typecasts that the programmer did not provide. This leads to one terrible event that you ought to be able to prevent when you use a dictionary: having an upgrade run successfully only to hit a run-time error when somebody uses the system.

A much simpler method is to assemble expressions by having the programmer provide formulas that are already cut up into pieces.

The Assembly Route

So I want to have formulas, including conditionals, and I want to be able to use the formulas in PHP, Java, Javascript, inside of triggers, and I want to be able to generate docs out of them that do not contain code fragments, and I want to be able to guarantee when an upgrade has run that there will be no errors introduced through programmer mistakes in the dictionary. The way to do this is to specify the formulas a little differently:

    column taxable:
        calculate:
            case 00:
                compare: @flag_taxable = Y
                return: @taxrate * @extended_amount
            case 01:
                return: 0

Here are the changes I have made for this version:

  1. The programmer must specify each case in order
  2. Each case is a compare statement followed by a return
  3. A case without a compare is unconditional, it always returns and processing ends
  4. I stuck little '@' signs in front of column names, I will explain those in a moment.

In short, we want the programmer to provide us with the conditional statements already parsed out into little pieces, so when we load them they look like data instead of code. We now have the responsibility for assembling code fragments, but in exchange we have pre-parsed data that can be handed to any programming language and used.

Conclusion: Assembly Means Data

The decision to go the assembly route is simply another example of the Minimize Code, Maximize Data principle. The dictionary itself should be composed entirely of data values, no code snippets should be allowed to sneak in. The reason is simple. No matter what route we follow we will have to validate and assemble the formula - be it for PHP, Javascript, or an alternate database server. But if we let the programmer give us code snippets we have the extra burden of parsing as well. Who needs it?

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

Sunday, January 18, 2009

The Data Dictionary and Calculations, Part 1

The stunning power of a data dictionary comes into play once the dictionary contains formulas for calculated values. The dictionary can then be used to generate code, and also to generate documentation. This double-win is not available without the calculations because the resulting docs and database would be incomplete, requiring tedious and error-prone manual completion.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Calculations and Normalization

Before I begin, I will point out that all calculated values stored in a database are denormalizing, they all introduce redundancies. This does not mean they are bad, it just means you need a way to make sure they stay correct (see Keeping Denormalized Values Correct, also see Normalization and Automation). If you cannot keep them correct, they will get very bad very fast. This essay will show you one approach to ensuring calculated values are always correct.

However, before I start, I have to point out how important it is to begin by normalizing your database (to at least 3NF) and adding calculations only upon the strong foundation of a normalized database. If you do not normalize first, you will discover that it is impossible to work up formulas that make any sense -- values will always seem to be not quite where you need them, and it will always seem you need one more kind of calculation to support, and it will be very difficult to write the code generator that gives strong results. But if you build on a normalized database, it turns out you only need a few features in your dictionary and your code generator.

Use Denormalization Patterns

Once you have normalized your database, you will find that your calculations all fall into three basic categories (detailed in April 2008 in Denormalization Patterns). These three patterns are:

  • FETCH operations, like copying an item's price from the ITEMS table to the ORDERLINES table.
  • EXTEND operations, which are calculations within a row, such as assigning EXTENDED_PRICE the value of QUANTITY * PRICE.
  • AGGREGATE operations, like a SUM of the lines of an order to the order header.

This week we will look at the first type of operations, the FETCH operations.

Putting the FETCH Into Your Data Dictionary

So we have an ORDERLINES table, and it contains a PRICE column, and the value of that column should be copied from the ITEMS table. This is an extremely common operation in most database applications, so we decide it would be really cool if we could specify that in the data dictionary and have the code generator take care of it. This would chop a lot of labor off the development process.

Here is how a column like this would appear in my own dictionary format:

table orderlines:
    description: Order Lines
    module: sales
    
    column price:
        automation_id: fetch
        auto_formula: items.price
    ...more columns...

This looks nice, I have put the formula for the PRICE column into the data dictionary. Now of course I need that formula to get out into the application somehow so that it will always be executed and will never be violated. We will now see how to do that.

The Trigger Approach

When it comes to code generators, if there are ten programmers in a room, there are going to be at least 10 opinions on how to write and use a code generator (the non-programmer boss will also have an opinion, so that makes 11). I have no interest in bashing anybody's approach or trying to list all of the possibilities, so I will stick with the approach I use myself, which is to generate database trigger code. If you want to know why that approach works for me, check out Triggers, Encapsulation and Composition.

When I work on code generators, I begin by manually coding an example of what I'm getting at, so I know it works. The trigger snippet we are looking for must do two things. It must make sure the price is always copied, and it must make sure that no user can subvert the value. This snippet (which is in the PostgreSQL flavor of server-side SQL) does this on an insert:

-- top of trigger....

    -- PART 1: Prevent users from subverting the
    --         the formula by throwing error if they
    --         try to supply a value:
    IF new.price IS NOT NULL THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    
    -- PART 2: If the value of SKU exists, use it to look
    --         up the price and copy it into the new row
    IF new.sku IS NOT NULL THEN 
      SELECT INTO new.price par.price
        FROM items par 
       WHERE new.sku = par.sku ;
    END IF;

-- more trigger stuff

NOTE! You may notice my trigger code somehow seems to "know" to use the SKU column when searching the ITEMS table, yet my formula did not specify that. I am assuming your data dictionary contains definitions of primary keys and foreign keys, otherwise it is of no real use. I am further assuming that when I see the formula to "FETCH" from the ITEMS table, I can look up the foreign key that matches ORDERLINES to ITEMS and find out what column(s) to use.

The example above works on INSERT operations only. You need a slightly different version for updates, which throws an error if the user attempts to change the price, and which does a new FETCH if the user has changed the SKU value.

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.price par.price
         FROM items par WHERE new.sku = par.sku ;
    END IF;

Sidebar: A Complete Trigger

If you want a teaser on how many amazing things the trigger can do once you've loaded up your dictionary and builder with features, here is a bit of code from a demo application. Most everything in it will get treated in this series on the data dictionary.

CREATE OR REPLACE FUNCTION orderlines_upd_bef_r_f()
  RETURNS trigger AS
$BODY$
DECLARE
    NotifyList text = '';
    ErrorList text = '';
    ErrorCount int = 0;
    AnyInt int;
    AnyInt2 int;
    AnyRow RECORD;
    AnyChar varchar;
    AnyChar2 varchar;
    AnyChar3 varchar;
    AnyChar4 varchar;
BEGIN
    SET search_path TO public;


    -- 1010 sequence validation
    IF (new.recnum_ol <> old.recnum_ol)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,3002, may not be re-assigned;';
    END IF;

    -- 1010 sequence validation
    IF (new.skey <> old.skey)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'skey,3002, may not be re-assigned;';
    END IF;

    -- 3100 PK Change Validation
    IF new.recnum_ol <> old.recnum_ol THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,1003,Cannot change value;';
    END IF;
    -- 3100 END


    IF new.flag_taxable <> old.flag_taxable THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'flag_taxable,5001,may not be explicitly assigned;';
    END IF;

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
   IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.flag_taxable
                   ,new.price
                   par.flag_taxable
                   ,par.price
         FROM items par WHERE new.sku = par.sku ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_retail <> old.amt_retail THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_retail,5002,Cannot assign value directly to column amt_retail ;';
    ELSE 
        new.amt_retail =  CASE WHEN  1 = 1  THEN new.price*new.qty        ELSE 0 END ;
    END IF;


    IF new.pct99_discount <> old.pct99_discount THEN 
       IF new.pct99_discount <> (SELECT par.pct99_discount FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'pct99_discount,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxauth <> old.taxauth THEN 
       IF new.taxauth <> (SELECT par.taxauth FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxauth,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxpct <> old.taxpct THEN 
       IF new.taxpct <> (SELECT par.taxpct FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxpct,5001,may not be explicitly assigned;';
       END IF;
    END IF;
   IF  coalesce(new.recnum_ord,0) <> coalesce(old.recnum_ord,0)  THEN 
       SELECT INTO new.pct99_discount
                   ,new.taxauth
                   ,new.taxpct
                   par.pct99_discount
                   ,par.taxauth
                   ,par.taxpct
         FROM orders par WHERE new.recnum_ord = par.recnum_ord ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_discount <> old.amt_discount THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_discount,5002,Cannot assign value directly to column amt_discount ;';
    ELSE 
        new.amt_discount =  CASE WHEN  1 = 1  THEN new.amt_retail*new.pct99_discount*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_net <> old.amt_net THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_net,5002,Cannot assign value directly to column amt_net ;';
    ELSE 
        new.amt_net =  CASE WHEN  1 = 1  THEN new.amt_retail-new.amt_discount        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_tax <> old.amt_tax THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_tax,5002,Cannot assign value directly to column amt_tax ;';
    ELSE 
        new.amt_tax =  CASE WHEN new.flag_taxable = 'Y' THEN new.amt_net*new.taxpct*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_due <> old.amt_due THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_due,5002,Cannot assign value directly to column amt_due ;';
    ELSE 
        new.amt_due =  CASE WHEN  1 = 1  THEN new.amt_net+new.amt_tax        ELSE 0 END ;
    END IF;

    -- 7010 Column Constraint
    new.flag_taxable = UPPER(new.flag_taxable);
    IF NOT (new.flag_taxable  IN ('Y','N')) THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'new.flag_taxable,6001,Column -Taxable- can be either Y or N;';
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.sku IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'sku,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.sku IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: sku;';
    ELSE
        -- LOCK TABLE items IN EXCLUSIVE MODE;
        SELECT INTO AnyInt COUNT(*) FROM items par 
            WHERE par.sku = new.sku;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'sku,1006,Please Select Valid Value: ' || new.sku::varchar || ';';
        END IF;
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.recnum_ord IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ord,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.recnum_ord IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: recnum_ord;';
    ELSE
        -- LOCK TABLE orders IN EXCLUSIVE MODE;
        SELECT INTO AnyInt COUNT(*) FROM orders par 
            WHERE par.recnum_ord = new.recnum_ord;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'recnum_ord,1006,Please Select Valid Value: ' || new.recnum_ord::varchar || ';';
        END IF;
    END IF;

    IF ErrorCount > 0 THEN
        RAISE EXCEPTION '%',ErrorList;
        RETURN null;
    ELSE
        IF NotifyList <> '' THEN 
             RAISE NOTICE '%',NotifyList;
        END IF; 
        RETURN new;
    END IF;
END; $BODY$
  LANGUAGE 'plpgsql' VOLATILE SECURITY DEFINER
  COST 100;
ALTER FUNCTION orderlines_upd_bef_r_f() OWNER TO postgresql;

Variatons on FETCH

I have found two variations on FETCH that have proven very useful in real world applications.

The first I call DISTRIBUTE. It is dangerous because it can be a real performance killer, and turns out you very rarely need it. However, that being said, sometimes you want to copy a value from a parent table down to every row in a child table when the value changes in the parent. The first time I did this was to copy the final score from a GAMES table into a WAGERS table on a fake sports betting site.

The other variation I have found useful is FETCHDEF, my shorthand for "fetch by default." In this variation the user is free to supply a value of their own, but if they do not supply a value then it will be fetched for them.

The Code Generator Itself

As for writing the code generator itself, that is of course far more than I can cover in one blog entry or even 10. Morever, since anybody who decides to do so will do so in their own language and in their own style, there is little to be gained by showing code examples here.

Conclusion: Expand Your Dictionary!

If you make up a data dictionary that only contains structure information like columns and keys, and you write a builder program to build your database, you can get a big win on upgrades and installs. However, you can take that win much farther by adding calculated values to your database and expanding your builder to write trigger code. This week we have seen what it looks like to implement a FETCH calculation in your dictionary and what the resulting trigger code might look like.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

Sunday, January 11, 2009

Upgrading Indexes With a Data Dictionary

When you set out to use a Data Dictionary to control your database upgrades, you must consider not just columns and tables, but also indexes and keys. The process for adding indexes and keys is almost the same as that for columns and tables, but there are a few wrinkles you have to be aware of.

Review of Basic Ideas

In my First Essay On Data Dictionaries, one of the major points was that the dictionary is easiest to use if it is in some type of plaintext format that is in source control along with the rest of your application files, and is the processed with a "builder" program.

Last week we saw the Basic compare operation used by the builder to build and update tables. You read in your data dictionary, query the information_schema to determine the current structure of the database, and then generate commands to add new tables and add new columns to existing tables.

The Importance of Keys and Indexes

If a builder program is to be useful, it must be complete, if it leaves you with manual tasks after a build then the entire concept of automation is lost. The builder must be able to build the entire structure of the database at very least, and this means it must be able to work out keys and indexes.

The Basic Steps

The basic steps of building indexes and keys are these:

  • Load your dictionary to some format you can work with easily. I prefer to load it to tables.
  • Query the databases's INFORMATION_SCHEMA to determine which indexes and keys already exist.
  • Execute some type of diff to determine which indexes need to be built.
  • Build the indexes that are not there.
  • If you like, drop the indexes that are not in the spec.

Sidebar: Dropping Indexes

A builder program can add things, and it can also drop things. When it comes to destructive operations, I prefer not to have my builder drop tables or columns, because the consequences of a mistake can be unthinkable.

However, when it comes to indexes, it is much more likely to be ok to drop a stray index. Dropping indexes does not destroy user data. Also, extraneous indexes will slow down inserts and updates, so getting rid of them is usually the Right Thing.

Step 1: Your Specification

Your data dictionary format must have some way of letting you specify an index. It is also a good idea to allow you to specify an ascending or descending property for each column, and to specify if the index is to be unique (effectively making it a unique constraint).

Here is an example of a very simple set of indexes:

table example:
    description: My example table
    
    index first_name:
        column first_name:
    index last_name:
        column last_name:
    index socialsec:
        unique: "Y"
        column socialsec:
        
    # ... column definitions follow...

I am currently working on a program that requires frequent access by three columns, where the first two are in descending order but not the first. An index spec for this might look like:

table shipments:
    description: Incoming Magazines
    
    index history:
        column bipad:
        column year:
            flag_asc: "N"
        column issue:
            flag_asc: "N"

    # ... column definitions follow...

As far as loading this into memory, I covered that in some detail last week and will not dwell on it here. I will simply assume you have code to parse and load the spec to a format that works for you.

Step 2: The Information Schema or Server Tables

When I set out to write my builder, I found that the information_schema was a bit more complicated than I needed. The server I was using, Postgres, had a simpler way to get what I wanted. I also found I would get all kinds of extraneous definitions of indexes on system tables or tables that were not in my spec. The query below was the easiest way to get index definitions that were limited to the tables in my spec on the Postgres platform:

Select tablename,indexname,indexdef 
 FROM pg_indexes
 JOIN zdd.tables on pg_indexes.tablename = zdd.tables.table_id
 WHERE schemaname='public'

As far as primary keys and foreign keys go, the story is basically the same, your server may provide them in a convenient way the way Postgres gives index definitions, or you may have to dig a little deeper to get precisely what you want.

Step 3: The Diff

So now we have a picture of the indexes we need to exist, and the indexes that already exist. It is time to look at how to diff them effectively. This step does not work the same way as it does with columns and tables.

Before we go into how to do the diff, let's review how we did it with tables and columns. We can basically diff tables and columns by name. If our spec lists table CUSTOMERS and it does not appear to exist in the database, we can build the table CUSTOMERS, simple as that. But with indexes the name really does not mean anything, what really matters is what columns are being indexed.

This is why we diff indexes on the column definitions, not on their names. If you want a complete trail, you would begin with this table that describes your own indexes:

SPEC_NAME   | TABLE     | COLUMNS
------------+-----------+-----------------
CUST1       | CUSTOMERS | zipcode:state
HIST1       | HISTORY   | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue

Then you pull the list of indexes from the server, and lets say you get something like this:

DB_NAME   | TABLE     | COLUMNS
----------+-----------+-----------------
CUST1     | CUSTOMERS | zipcode:state
ABCD      | HISTORY   | bipad:year:issue
ORDER1    | ORDERS    | year:issue

When you join these two together, you are matching on TABLE and COLUMNS, we do not care about the index names. A query to join them might look like this:

SELECT spec.spec_name,spec.table,spec.columns
      ,db.db_name,db.columns as db_cols
  FROM spec
  FULL OUTER JOIN db   On spec.table = db.table
                      AND spec.columns = db.column

This query would give us the following output:

SPEC_NAME   | TABLE     | COLUMNS          | DB_NAME | DB_COLS 
------------+-----------+------------------+---------+---------------------
CUST1       | CUSTOMERS | zipcode:state    | CUST1   | zipcode:state
HIST1       | HISTORY   | bipad:year:issue | ABCD    | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue |         |
            |           |                  | ORDER1  | year:issue

Now let us examine the results row by row.

  • The first row shows that the index on zipcode+state on the customers table is in the spec and in the database, we take no action on that index.
  • The second row shows that the index on bipad+year+issue is also in both the database and the spec. This particular index has a different name in the database, but we don't care. (Maybe the programmer changed the name in the spec). We take no action on this index.
  • The third line shows an index on the ORDERS table that is not in the database, we must build that index.
  • The fourth line shows an index in the database that is not in the spec, you can drop that if you want to.

The Rest of It

From here it is a simple matter to generate some commands to create the indexes we need.

Keys work the same way, with a few obvious differences in how they might be named.

We can add features from here to track if the columns are being indexed in ascending or descending order.

Conclusion: Indexes Go By Definition

When writing a database upgrade "builder" program, they key thing to understand about indexes and keys is that you are looking to indentify and build indexes according to their definition, and that names do not matter at all.

Sunday, January 4, 2009

Dictionary Based Database Upgrades

The number one search term that brings people to this blog is "data dictionary." So this week I will begin a series on how to use the data dictionary to improve your own productivity and reduce errors.

Building And Upgrading

This week we are going to see how to use a data dictionary to eliminate upgrade scripts (mostly) and make for more efficient upgrades. The approach described here also works for installing a system from scratch, so an install and an upgrade become the same process.

The major problems with upgrade scripts are these:

  • They are the least-tested code in any system, and are the most likely to break. If a script breaks and anybody but the original programmer is running the upgrade, this leads to aborted upgrades and upset customers.
  • They are horribly inefficient when a customer upgrades after a long time: the same table may be rebuilt many times as script after script adds a column or two.

By contrast, a dictionary-based upgrade can take any customer from any version of your software and in the fewest steps possible bring them completely current, with no possibility of broken scripts.

But first, a quick review of an important idea...

Review: The Text File

Back in June of 2008 this blog featured an overview of the data dictionary. There were many ideas in that essay, but I wish to focus on one in particular, the question of where to put the data dicationary and in what format.

In terms of where to put it, the data dictionary should be just another application file, in source control, and delivered with your code. When the dictionary is in a plaintext file (or files) and treated like other application code, you do not have to invent any new methods for handling it, just use the same methods you use for the rest of your code.

In simple practical terms, it is best if a data dictionary can be easily read and written by both people and computers. This leads to a plaintext file or files in some format such as JSON or YAML. I personally prefer YAML because it is a superset of JSON, so using YAML gives me the option to sneak in JSON syntax later if I choose, but starting with JSON does not let me go the other way.

Requiring easy handling by people tends to rule out XML, which is bletcherous to work with manually (in my humble opinion). Requiring readability by the computer rules out UML unless your UML drawing tool can produce a usable data file (comments always welcome, tell us your favorite tool for doing this!). When considering UML, it is the class diagrams that are most likely to be translatable into a data dictionary.

Finally, encoding dictionary information in program class files technically meets the practical requirements listed above, but it has the disadvantage of trapping data in code, which unnecessarily couples your dictionary to whatever language you are using at the moment. It is much better if the dictionary sits outside of the code as pure data. Not to mention that spreading the dictionary out in a collection of one-class-per-table files makes it much harder to do upgrades in the way I am about to describe.

Review of Steps

When using a dictionary-based approach, you write some type of "builder" program that reads your dictionary file, examines the current structure of the database, and then generates SQL commands to alter and create tables to make them all current.

There are plenty of ways to do this. My own approach is to load the dictionary itself into tables, pull the current state into similar tables, and then do queries to find new and altered tables and columns. If you want to see a full-blown program of this type, check out androBuild.php, the Andromeda implementation of this idea. The routines that apply to today's topic include "RealityGet()", "Differences()", "Analyze()" and "PlanMake()".

Step 1: Load the Dictionary to RAM

To use the approach in this essay, you begin by parsing your plaintext file and loading it to tables. Here is a simple example of a what a dictionary file might look like in YAML format:

table states:
    description: US States
    
    columns:
        state:
            type: char
            colprec: 2
            caption: State Code
            primary_key: "Y"
        description:
            type: varchar
            colprec: 25
            caption: State Name

If you are using PHP, you can parse this file using the spyc program, which converts the file into associative arrays. All or nearly all modern languages have a YAML parser, check out the YAML site to find yours.

Step 2: Load the Dictionary To Tables

The database you are building should have some tables that you can use as a scratch area during the upgrade. You may say, "The builder gives me tables, but I need tables to run the builder, how can I do this?" The simplest way is to hardcode the creation of these tables. A more mature solution would use a separate dictionary file that just defines the dictionary tables.

The structure of the tables should match the data file, of course. Here is what the YAML above would like like after being loaded to tables:

TABLE   | DESCRIPTION
--------+--------------------------------
states  | US States  
   

TABLE   |COLUMN      |CAPTION    |TYPE    |PRECISION  
--------+------------+-----------+--------+-----------
states  |state       |State Code |char    |2          
states  |description |State Name |varchar |25

Step 3: Fetch The Current State

All modern databases support the "information_schema" database schema, a schema inside of each database that contains tables that describe the structure of the database. While you can make queries directly against the information_schema tables, I prefer to fetch the information out of them into my own tables so that all column names are consistent with my own. A simple query to do this might look like this:

-- Postgres-specific example of pulling info out of the
-- information_schema table:

insert into TABLES_NOW (table_id)  -- my dictionary table
SELECT table_name as table_id 
  FROM information_schema.tables  
 WHERE table_schema = 'public'
   AND table_type = 'BASE TABLE'

Pulling column information out can be much more complicated, owing to differences in how vendors implemement information_schema, and owing to the complex way data is stored in it. Here is my own code to pull out the definitions of columns from the Postgres information_schema, which also simplifies the definition dramatically, to make my downstream coding easier:

insert into zdd.tabflat_r 
 (table_id,column_id,formshort,colprec,colscale)  
 SELECT c.table_name,c.column_name, 
        CASE WHEN POSITION('timestamp' IN data_type) > 0 THEN 'timestamp'
           WHEN POSITION('character varying' IN data_type) > 0 THEN 'varchar'
           WHEN POSITION('character' IN data_type) > 0 THEN 'char'
             WHEN POSITION('integer' IN data_type) > 0 THEN 'int'
             ELSE data_type END,
        CASE WHEN POSITION('character' IN data_type) > 0 THEN character_maximum_length
         WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_precision 
     ELSE 0 END,
        CASE WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_scale
         ELSE 0 END
   FROM information_schema.columns c 
   JOIN information_schema.tables t ON t.table_name = c.table_name  
  WHERE t.table_schema = 'public' 
    AND t.table_type   = 'BASE TABLE'");

Step 4: The Magic Diff

Now we can see how the magic happens. Imagine you have 20 tables in your application, and in the past week you have modified 5 of them and added two more. You want to upgrade your demo site, so what is the next step for the builder?

The builder must now do a "diff" between your dictionary and the actual state of the database, looking for:

  • Completely new tables.
  • New columns in existing tables.

Lets say you have two tables, "TABLES_SPEC" which lists the tables in your application. Then you have "TABLES_NOW" that lists the tables in your database. The following query will give you a list of new tables:

SELECT spec.table_id
  FROM TABLES_SPEC spec 
 WHERE NOT EXISTS (
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id
       )

It is now a simple thing pull the column definitions for each table and generate some DDL to create the tables.

But we also have tables that have new columns. We can pull those out like so:

SELECT * from COLUMNS_SPEC spec
 -- the first where clause gets new columns
 WHERE not exists (
       SELECT table_id FROM COLUMNS_NOW now
        WHERE spec.table_id = now.table_id
          AND spec.column_id= now.column_id
       )
   -- this second subquery makes sure we are
   -- getting only existing tables
   AND EXISTS (
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id
       )

Now again it is a simple matter to generate DDL commands that add all of the new columns into each table. Some databases will allow multiple columns to be added in one statement, while others will require one ALTER TABLE per new column (really horrible when you have to do that).

Please note this is sample code only, just to give you ideas, and it will not cover every case.

Sidebar: Avoid Destructive Actions

Do not rush into writing code that drops columns or tables that are not in your spec. The results of a misstep can be disastrous (as in lose your job or your customer). My own builder code is now 4 1/2 years old and I have never yet bothered to write a "destructive" upgrade that will clean out unused tables and columns. Maybe someday...

What I Left out: Validation

There was no space in this essay to discuss a very important topic: validating the spec changes. It may be that a programmer has done something nasty like change a column type from character to integer. Most databases will fail attempting to alter the column because they don't know how to convert the data. Your builder program can trap these events by validating the upgrade before any changes are made. This will be treated fully in a future essay.

More that I Left Out: Indexes, Keys...

There are many many other things you can and really must create during the build, beginning with primary key and foreign keys, not to mention indexes as well. These will be covered in a future essay.

More That I Left Out: When You Still Need Scripts

There are plenty of reasons why you may still need a few upgrade scripts, these will be discussed in a future essay. They all come down to moving data around when table structures change significantly.

Conclusion: One Upgrade To Rule Them All

The approach described this week for upgrading databases has many advantages. It is first and foremost the most efficient way to upgrade customers from any version directly to the latest version. It is also the simplest way to handle both installations and upgrades: they are both the same process. Putting the dictionary file into plaintext gives you complete source control just like any other application file, and overall you have a tight, efficient and error-free upgrade process.