Monday, June 29, 2009

Approaches to "UPSERT"

This week in the Database Programmer we look at something called an "UPSERT", the strange trick where an insert command may magically convert itself into an update if a row already exists with the provided key. This trick is very useful in a variety of cases. This week we will see its basic use, and next week we will see how the same idea can be used to materialize summary tables efficiently.


The idea behind an UPSERT is simple. The client issues an INSERT command. If a row already exists with the given primary key, then instead of throwing a key violation error, it takes the non-key values and updates the row.

This is one of those strange (and very unusual) cases where MySQL actually supports something you will not find in all of the other more mature databases. So if you are using MySQL, you do not need to do anything special to make an UPSERT. You just add the term "ON DUPLICATE KEY UPDATE" to the INSERT statement:

insert into table (a,c,b) values (1,2,3)
    on duplicate key update
     b = 2,
     c = 3

The MySQL command gives you the flexibility to specify different operation on UPDATE versus INSERT, but with that flexibility comes the requirement that the UPDATE clause completely restates the operation.

With the MySQL command there are also various considerations for AUTO_INCREMENT columns and multiple unique keys. You can read more at the MySQL page for the INSERT ... ON DUPLICATE KEY UPDATE feature.

A Note About MS SQL Server 2008

MS SQL Server introduced something like UPSERT in SQL Server 2008. It uses the MERGE command, which is a bit hairy, check it out in this nice tutorial.

Coding a Simpler UPSERT

Let us say that we want a simpler UPSERT, where you do not have to mess with SQL Server's MERGE or rewrite the entire command as in MySQL. This can be done with triggers.

To illustrate, consider a shopping cart with a natural key of ORDER_ID and SKU. I want simple application code that does not have to figure out if it needs to do an INSERT or UPDATE, and can always happily do INSERTs, knowing they will be converted to updates if the line is already there. In other words, I want simple application code that just keeps issuing commands like this:


We can accomplish this by a trigger. The trigger must occur before the action, and it must redirect the action to an UPDATE if necessary. Let us look at examples for MySQL, Postgres, and SQL Server.

A MySQL Trigger

Alas, MySQL giveth, and MySQL taketh away. You cannot code your own UPSERT in MySQL because of an extremely severe limitation in MySQL trigger rules. A MySQL trigger may not affect a row in a table different from the row originally affected by the command that fired the trigger. A MySQL trigger attempting to create a new row may not affect a different row.

Note: I may be wrong about this. This limitation has bitten me on several features that I would like to provide for MySQL. I am actually hoping this limitation will not apply for UPSERTs because the new row does not yet exist, but I have not had a chance yet to try.

A Postgres Trigger

The Postgres trigger example is pretty simple, hopefully the logic is self-explanatory. As with all code samples, I did this off the top of my head, you may need to fix a syntax error or two.

CREATE OR REPLACE FUNCTION orderlines_insert_before_F()
    result INTEGER; 
    -- Find out if there is a row
    result = (select count(*) from orderlines
                where order_id = new.order_id
                  and sku      = new.sku

    -- On the update branch, perform the update
    -- and then return NULL to prevent the 
    -- original insert from occurring
    IF result = 1 THEN
        UPDATE orderlines 
           SET qty = new.qty
         WHERE order_id = new.order_id
           AND sku      = new.sku;
        RETURN null;
    END IF;
    -- The default branch is to return "NEW" which
    -- causes the original INSERT to go forward
    RETURN new;


-- That extremely annoying second command you always
-- need for Postgres triggers.
CREATE TRIGGER orderlines_insert_before_T
   before insert
   EXECUTE PROCEDURE orderlines_insert_before_F();

A SQL Server Trigger

SQL Server BEFORE INSERT triggers are significantly different from Postgres triggers. First of all, they operate at the statement level, so that you have a set of new rows instead of just one. Secondly, the trigger must itself contain an explicit INSERT command, or the INSERT never happens. All of this means our SQL Server example is quite a bit more verbose.

The basic logic of the SQL Server example is the same as the Postgres, with two additional complications. First, we must use a CURSOR to loop through the incoming rows. Second, we must explicitly code the INSERT operation for the case where it occurs. But if you can see past the cruft we get for all of that, the SQL Server exmple is doing the same thing:

CREATE TRIGGER upsource_insert_before
ON orderlines
    DECLARE @new_order_id int;
    DECLARE @new_sku      varchar(15);
    DECLARE @new_qty      int;
    DECLARE @result       int;

    DECLARE trig_ins_orderlines CURSOR FOR 
            SELECT * FROM inserted;
    OPEN trig_ins_orderlines;

    FETCH NEXT FROM trig_ins_orderlines
     INTO @new_order_id

    WHILE @@Fetch_status = 0 
        -- Find out if there is a row now
        SET @result = (SELECT count(*) from orderlines
                        WHERE order_id = @new_order_id
                          AND sku      = @new_sku
        IF @result = 1 
            -- Since there is already a row, do an
            -- update
            UPDATE orderlines
               SET qty = @new_qty
             WHERE order_id = @new_order_id
               AND sku      = @new_sku;
            -- When there is no row, we insert it
            INSERT INTO orderlines 
            UPDATE orderlines

        -- Pull the next row
        FETCH NEXT FROM trig_ins_orderlines
         INTO @new_order_id

    END  -- Cursor iteration

    CLOSE trig_ins_orderlines;
    DEALLOCATE trig_ins_orderlines;


A Vague Uneasy Feeling

While the examples above are definitely cool and nifty, they ought to leave a certain nagging doubt in many programmers' minds. This doubt comes from the fact that an insert is not necessarily an insert anymore, which can lead to confusion. Just imagine the new programmer who has joined the team an is banging his head on his desk because he cannot figure out why his INSERTS are not working!

We can add a refinement to the process by making the function optional. Here is how we do it.

First, add a column to the ORDERLINES table called _UPSERT that is a char(1). Then modify the trigger so that the UPSERT behavior only occurs if the this column holds 'Y'. It is also extremely import to always set this value back to 'N' or NULL in the trigger, otherwise it will appear as 'Y' on subsequent INSERTS and it won't work properly.

So our new modified explicit upsert requires a SQL statement like this:


Our trigger code needs only a very slight modification. Here is the Postgres example, the SQL Server example should be very easy to update as well:

   ...trigger declration and definition above
   IF new._upsert = 'Y'
      result = (SELECT.....);
      _upsert = 'N';
      result = 0;
   END IF; of trigger is the same


The UPSERT feature gives us simplified code and fewer round trips to the server. Without the UPSERT there are times when the application may have to query the server to find out if a row exists, and then issue either an UPDATE or an INSERT. With the UPSERT, one round trip is eliminated, and the check occurs much more efficiently inside of the server itself.

The downside to UPSERTs is that they can be confusing if some type of explicit control is not put onto them such as the _UPSERT column.

Next week we will see a concept similar to UPSERT used to efficiently create summary tables.

Sunday, April 19, 2009

The Relational Model

If you look at any system that was born on and for the internet, like Ruby on Rails, or the PHP language, you find an immense wealth of resources on the internet itself, in endless product web sites, blogs, and forums. But when you look for the same comprehensive information on products or ideas that matured before the web you find it is not there. Relational databases stand out as a product family that matured before the internet, and so their representation in cyberspace is very different from the newer stuff.

The Math Stuff

You may have heard relational theorists argue that the strength of relational databases comes from their solid mathematical foundations. Perhaps you have wondered, what does that mean? And why is that good?

To understand this, we have to begin with Edsger W. Dijkstra, a pioneer in the area of computer science with many accomplishments to his name. Dijkstra believed that the best way to develop a system or program was to begin with a mathematical description of the system, and then refine that system into a working program. When the program completely implemented the math, you were finished.

There is a really huge advantage to this approach. If you start out with a mathematical theory of some sort, which presumably has well known behaviors, then the working program will have all of those behaviors and, put simply, everybody will know what to expect of it.

This approach also reduces time wasted on creative efforts to work out how the program should behave. All those decisions collapse intot he simple drive to make the program mimic the math.

A Particular Bit of Math

It so happens that there is a particular body of math known as Relational Theory, which it seemd to E. F. Codd would be a very nice fit for storing business information. In his landmark 1970 paper A Relational Model of Data for Large Shared Data Banks (pdf) he sets out to show how these mathematical things called "relations" have behaviors that would be ideal for storing business models.

If we take the Dijkstra philosophy seriously, which is to build systems based on well-known mathematical theories, and we take Codd's claim that "Relations" match well to business record-keeping needs, the obvious conclusion is that we should build some kind of "Relational" datastore, and so we get the Relational Database systems of today.

So there in a nutshell is why relational theorists are so certain of the virtues of the relational model, it's behaviors are well-known, and if you can build something that matches them, you will have a very predictable system.

They are Still Talking About It

If you want to know more about the actual mathematics, check out the comp.databases.theory Usenet group, or check out Wikipedia's articles on Relational Algebra and Relational Calculus.

A Practical Downside

The downside to all of this comes whenever the mathematical model describes behaviors that are contrary to human goals or simply irrelevant to them. Examples are not hard to find.

When the web exploded in popularity, many programmers found that their greatest data storage needs centered on documents like web pages rather than collections of atomic values like a customer's discount code or credit terms. They found that relational databases were just not that good at storing documents, which only stands to reason because they were never intended to. In theory the model could be stretched, (if the programmer stretched as well), but the programmers could feel in their bones that the fit was not right, and they began searching for something new.

Another example is that of calculated values. If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory, and there are some very bright theorists who will refuse all requests for assistance in getting that value to work, because you have violated their theory. This is probably the most shameful behavior that relational theorists exhibit - a complete refusal to consider extending the model to better reflect real world needs.

The Irony: There are No Relational Databases

The irony of it all is that when programmers set out to build relational systems, they ran into quite a few practical downsides and a sort of consensus was reached to break the model and create the SQL-based databases we have today. In a truly relational system a table would have quite a few more rules on it than we have in our SQL/TABLE based systems of today. But these rules must have seemed impractical or too difficult to implement, and they were scratched.

There is at least one product out there that claims to be truly relational, that is Dataphor.

The Weird Optional Part

Probably the grandest irony in the so-called relational database management systems is that any programmer can completely break the relational model by making bad table designs. If your tables are not normalized, you lose much of the benefits of the relational model, and you are completely free to make lots of non-normalized and de-normalized tables.


I have to admit I have always found the strength of relational databases to be their simplicy and power, and not so much their foundations (even if shaky) in mathematical theory. A modern database is very good at storing data in tabular form, and if you know how to design the tables, you've got a great foundation for a solid application. Going further, I've always found relational theorists to be unhelpful in the extreme in the edge cases where overall application needs are not fully met by the underlying mathematical model. The good news is that the products themselves have all of the power we need, so I left the relational theorists to their debates years ago.

Sunday, March 1, 2009

I Am But a Humble Filing Clerk

This week we are returning to the series on Philosophy, and we will nail down the role of data and the database in any application that requires such things.

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Review of The Absolute

In the first essay in this series, The Quest For the Absolute, I offered the opinion that all programmers by nature seek absolutes to simplify and inform the development effort. Taking a page from the ancient Greek philosopher Aristotle, I suggested that the best absolute was the quest for the virtuous program, which is to say a program that served its purpose well.

A program that serves its purpose well is one that meets the needs of the check signer, the end-user, and the programmer. The check signer needs some strategic goal to be met, the end-user must be productive, and the programmer must make a living. If a program achieves all of these, it is an ideal virtuous program, and has satisfied the absolute requirements that are true of all programs.

Considering the Decision Maker

Normally we think of a decision maker as some important person who has the power to choose your product or services, or to give her money to your competitor. She makes her decision based on how well she can judge who will meet her strategic needs.

Although the decision maker will have vastly different needs in different situations, and is usually thinking at a high level, she has at least one need that is universal: the simple need to keep and use records. She needs a filing system. All of her stated goals will assume that you both know this unstated goal is down there at the foundation of the entire proposed system.

We programmers often forget this simple fact because computers have been around long enough that we do not remember that in their original forms it was impossible to mistake that computers were just electronic filing systems. Way back when I was a kid the day came when phone bills started arriving with an "IBM Card" slipped into them. You returned the card with your check -- they were moving their files into the electronic age. Then came electronic tickets on airlines -- nothing more than a big filing system. The modern web sites we visit to buy tickets are nothing but an interface to what remains a filing system at its heart.

The Virtuous Programmer

So if we go back to the idea of "virtue" as the Greeks thought of it, which means serving your function well, a virtuous programmer will remember always that he is but humble filing clerk. This is not his entire purpose, but it is the beginning of all other purposes and the foundation that the higher purposes are built upon.

Not Just Relational

This principle is general to all programming. An email server is a program that must receive and store email for later retrieval. What good is an email server that cannot store anything? What good is a camera without its memory card? What good is a mobile phone without its contacts list? What good is a image editing program if it cannot read and write files?

So all programs exist to process data, and the business application programmer knows that in his context this means we are really making giant sexy record-keeping systems. We are the guys that color-code the filing cabinets.

Does Not Mean Relational Is Required

This idea, that we are filing clerks, does not automatically mean we must pick relational databases for the persistence layer -- the question of what filing system to use is a completely different question.


If we begin with the idea that the ideal program meets the needs of decision maker, end-user, and programmer, and if we consider first the needs of the decision maker, then we begin with the universal strategic need to keep good records. The ideal programmer knows this need is at the bottom of all other needs, and remembers always that we are but humble filing clerks.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

Saturday, February 14, 2009

A Comprehensive Database Security Model

This week I am taking a bit of a departure. Normally I write about things I have already done, but this week I want to speculate a bit on a security model I am thinking of coding up. Basically I have been asking myself how to create a security model for database apps that never requires elevated privileges for code, but still allows for hosts sharing multiple applications, full table security including row level and column level security, and structural immunity to SQL injection.

The Functional Requirements

Let's consider a developer who will be hosting multiple database applications on a server, sometimes instances of the same application for different customers. The applications themselves will have different needs, but they all boil down to this:

  • Some applications will allow surfers to join the site and create accounts for themselves, while others will be private sites where an administrator must make user accounts.
  • Some applications will not contain sensitive data, and so the site owner wants to send forgotten passwords in email -- which means the passwords must be stored in plaintext. Other site owners will need heightened security that disallows storing of passwords in plaintext.
  • In both cases, administrators must of course be able to manage accounts themselves.
  • The system should be structurally immune to SQL injection.
  • It must be possible to have users with the same user id ("Sheilia", "John", etc.) on multiple applications who are actually totally different people.
  • The application code must never need to run at an elevated privelege level for any reason -- not even to create accounts on public sites where users can join up and conduct transactions.
  • It must be possible for the site owners or their agents to directly connect to the database at very least for querying and possibly to do database writes without going through our application.
  • Users with accounts on one app must never be able to sign on to another app on the same server.

These requirements represent the most flexible possible combination of demands that I have so far seen in real life. The question is, can they be met while still providing security? The model I'd like to speculate on today says yes.

Informed Paranoia Versus Frightened Ignorance

Even the most naive programmer knows that the internet is not a safe place, but all too often a lot of security advice you find is based on frightened ignorance and takes the form, "never do x, you don't know what might happen." If we are to create a strong security model, we have to do better than this.

Much better is to strive to be like a strong system architect, whose approach is based on informed paranoia. This hypothetical architect knows everybody is out to compromise his system, but he seeks a thorough knowledge of the inner workings of his tools so that he can engineer the vulnerabilities out as much as possible. He is not looking to write rules for the programmer that say "never do this", he is rather looking to make it impossible for the user or programmer to compromise the system.

Two Examples

Let us consider a server hosting two applications, which are called "social" and "finance".

The "social" application is a social networking site with minimal security needs. Most important is that the site owners want members of the general public to sign up, and they want to be able to email forgotten passwords (and we can't talk them out of it) -- so we have to store passwords in plaintext.

The "finance" application is a private site used by employees of a corporation around the world. The general public is absolutely not welcome. To make matters worse however, the corporation's IT department demands to be able to directly connect to the database and write to the database without going through the web app. This means the server will have an open port to the database. Sure it will be protected with SSL and passwords, but we must make sure that only users of "finance" can connect, and only to their own application.

Dispensing With Single Sign-On

There are two ways to handle connections to a database. One model is to give users real database accounts, the other is to use a single account to sign on to the database. Prior to the web coming along, there were proponents of both models in the client/server world, but amongst web developers the single sign-on method is so prevalent that I often wonder if they know there is any other way to do it.

Nevertheless, we must dispense with the single sign-on method at the start, regardless of how many people think that Moses carved it on the third tablet, because it just has too many problems:

  • Single Sign-on is the primary architectural flaw that makes SQL injection possible. As we will see later, using real database accounts makes your site (almost) completely immune to SQL injection.
  • Single Sign-on requires a connection at the maximum privilege level that any system user might have, where the code then decides what it will let a particular user do. This is a complete violation of the requirement that code always run at the lowest possible privilege level.
  • Single Sign-on totally prevents the requirement that authorized agents be allowed to connect to the database and directly read and write values.

So single sign-on just won't work with the requirements listed. This leads us to creating real accounts on the database server.

Real Accounts and Basic Security

When you use a real database account, your code connects to the database using the username and password provided by the user. Anything he is allowed to do your code will be allowed to do, and anything he is not allowed to do will throw and error if your code tries to do it.

This approach meets quite a few of our requirements nicely. A site owner's IT department can connect with the same accounts they use on the web interface -- they have the same privileges in both cases. Also, there is no need to ever have application code elevate its privilege level during normal operations, since no regular users should ever be doing that. This still leaves the issue of how to create accounts, but we will see that below.

A programmer who thinks of security in terms of what code can run will have a very hard time wrapping his head around using real database accounts for public users. The trick to understanding this approach is to forget about code for a minute and to think about tables. The basic fact of database application security is that all security resolves to table permissions. In other words, our security model is all about who can read or write to what tables, it is not about who can run which program.

If we grant public users real database accounts, and they connect with those accounts, the security must be handled within the database itself, and it comes down to:

  • Defining "groups" as collections of users who share permissions at the table level.
  • Deciding which groups are allowed select, insert, update, and delete privileges on which tables.
  • Granting and revoking those privileges on the server itself when the database is built.
  • At very least row-level security will be required, wherein a user can only see and manipulate certain rows in a table. This is how you keep users from using SQL Injection to mess with each other's order history or member profiles.
  • Column security is also very nice to finish off the picture, but we will not be talking about that today as it does not play into the requirements.

Now we can spend a moment and see why this approach eliminates most SQL Injection vulnerabilities. We will imagine a table of important information called SUPERSECRETS. If somebody could slip in a SQL injection exploit and wipe out this table we'd all go to jail, so we absolutely cannot allow this. Naturally, most users would have no privileges on this table -- even though they are directly connected to the database they cannot even see the table exists, let alone delete from it. So if our hypothetical black hat somehow slips in ";delete from supersecrets" and our code fails to trap for it, nothing happens. They have no privlege on that table. On the other side of things, consider the user who is privileged to delete from that table. If this user slips in a ";delete from supersecrets" he is only going to the trouble with SQL Injection to do something he is perfectly welcome to do anyway through the user interface. So much for SQL injection.

To repeat a point made above: row-level security is a must. If you grant members of a social site global UPDATE privileges on the PROFILES table, and you fail to prevent a SQL Injection, all hell could break loose. Much better is the ability to limit the user to seeing only his own row in the PROFILE table, so that once again you have created a structural immunity to SQL injection.

Anonymous Access

Many public sites allow users to see all kinds of information when they are not logged on. The most obvious example would be an eCommerce site that needs read access to the ITEMS table, among others. Some type of anonymous access must be allowed by our hypothetical framework.

For our two examples, the "social" site might allow limited viewing of member profiles, while the "finance" application must show absolutely nothing to the general public.

If we want a general solution that fits both cases, we opt for a deny-by-default model and allow each application to optionally have an anonymous account.

First we consider deny-by-default. This means simply that our databases are always built so that no group has any permissions on any tables. The programmer of the "social" site now has to grant certain permissions to the anonymous account, while the programmer of the "finance" application does nothing - he already has a secure system.

But still the "finance" site is not quite so simple. An anonymous user account with no privileges can still log in, and that should make any informed paranoid architect nervous. We should extend the deny-by-default philosophy so the framework will not create an anonymous account unless requested. This way the programmer of the "finance" application still basically does nothing, while the programmer of the "social" must flip a flag to create the anonymous account.

Virtualizing Users

If we are having real database accounts, there is one small detail that has to be addressed. If the "social" site has a user "johnsmith" and the finance application has a user of the same name, but they are totally different people, we have to let both accounts exist but be totally separate.

The answer here is to alias the accounts. The database server would actually have accounts "finance_johnsmith" and "social_johnsmith". Our login process would simply take the username provided and append the code in front of it when authenticating on the server. 'nuf said on that.

Allowing Public Users To Join

The "social" site allows anybody to join up and create an account. This means that somehow the web application must be able to create accounts on the database server. Yet it must do this without allowing the web code to elevate its privileges, and while preventing the disaster that would ensue if a user on the "social" site somehow got himself an account on the "finance" site.

Believe it or not, this is the easy part! Here is how it works for the "social" site:

  • Create a table of users. The primary key is the user_id which prevents duplication.
  • For the social site, there is a column called PASSWORD that stores the password in plaintext.
  • Allow the anonymous account to INSERT into this table! (Remember though that deny-by-default means that so far this account has no other privileges).
  • Put an INSERT trigger on the table that automatically creates an aliased user account, so that "johnsmith" becomes "social_johnsmith". The trigger also sets the password.
  • A DELETE trigger on the table would delete users if the row is deleted.
  • An UPDATE trigger on the table would update the password if the user UPDATES the table.
  • Row level security is an absolute must. Users must be able to SELECT and UPDATE table, but only their own row. If your database server or framework cannot support row-level security, it's all out the window.

This gives us a system that almost gets us where we need to be: the general public can create acounts, the web application does not need to elevate its privileges, users can set and change their passwords, and no user can see or set anything for any other user. However, this leaves the issue of password recovery.

In order to recover passwords and email them to members of the "social" site, it is tempting to think that the anonymous account must be able to somehow read the users table, but that is no good because then we have a structural flaw where a successful SQL injection would expose user accounts. However, this also turns out to be easy. There are two options:

  • Write a stored procedure that the anonymous user is free to execute, which does not return a password but actually emails it directly from within the database server. This requires your database server be able to send emails. (Postgres can, and I assume SQL Server can, and I don't really know about mySql).
  • Create a table for password requests, allow inserts to it but nothing else. A trigger sends the email. In this approach you can track email recovery requests.

For the "finance" application we cannot allow any of this to happen, so again we go to the deny-by-default idea. All of the behaviors above will not happen unless the programmer sets a flag to turn them on when the database is built.

This does leave the detail of how users of the "finance" application will reset their passwords. For details on how a secure app can still allow password resets, see my posting of Sept 7 2008 Secure Password Resets.

One More Detail on Public Users

We still have one more detail to handle for public users. Presumably a user, having joined up, has more privileges than the anonymous account. So the web application must be able to join them into a group without elevating its privileges. The solution here is the same as for creating the account: there will be a table that the anonymous user can make inserts into (but nothing else), and a trigger will join the user to whatever group is named.

Except for one more detail. We cannot let the user join whatever group they want, only the special group for members. This requirement can be met by defining the idea of a "freejoin" group and also a "solo" group. If the anonymous user inserts into a user-group table, and the requested group is flagged as allowing anybody to join, the trigger will allow it, but for any other group the trigger will reject the insert. The "solo" idea is similar, it means that if a user is in the "members" group, and that group is a "solo" group, they may not join any other groups. This further jails in members of the general public.

Almost Done: User Administration

In the last two sections we saw the idea of a table of users and a cross-reference of users to groups. This turns out to solve another issue we will have: letting administrators manage groups. If we define a group called "user_administrators" and give them total power on these tables, and also give them CRUD screens for them, then we have a user administrator system. This works for both the "social" and the "finance" application.

The triggers on the table have to be slightly different for the two cases, but that is a small exercise to code them up accordingly.

Cross-Database Access

Believe it or not, the system outlined above has met all of our requirements except one. So far we have a system that never requires the web server to have any elevated priveleges within the database, allows members of the public to join some sites while barring them from others, is structurally immune from SQL injection, allows different people on different sites to have the same user id, and allows administrators of both sites to directly manage accounts. Moreover, we can handle both plaintext passwords and more serious reset-only situations.

This leaves only one very thorny issue: cross-database access. The specific database server I use most is PostgreSQL, and this server has a problem (for this scenario) anyway, which is that out-of-the-box, a database account can connect to any database. This does not mean the account has any priveleges on the database, but we very seriously do not want this to happen at all. If a member of the "social" site can connect to the "finance" app, we have a potential vulnerability even if he has zero privileges in that database. We would be much happier if he could not connect at all.

In Postgres there is a solution to this, but I've grown to not like it. In Postgres you can specify that a user can only connect to a database if they are in a group that has the same name as the database. This is easy to set up, but it requires changing the default configuration of Postgres. However, for the sheer challenge of it I'd like to work out how to do it without requiring that change. So far I'm still puzzling this out. I'd also like to know that the approach would work at very least on MS SQL Server and mySql.


Most of what is in this week's essay is not that radical to any informed database veteran. But to web programmers who were unfortunate enough to grow up in the world of relational-databases-must-die nonsense, it is probably hard or impossible to imagine a system where users are connecting with real database accounts. The ironic thing is that the approached described here is far more secure than any single sign-on system, but it requires the programmer to shift thinking away from action-based code-centric models to what is really going on: table-based privileges. Once that hurdle is past, the rest of it comes easy.

Sunday, February 1, 2009

This Application Has Unique Business Rule Needs

No it does not. If it did, then your customer/employer would be doing something no other human being has ever done, which is unlikely in the extreme. The application may be unique in its particulars, but it is almost certainly extremely common in its patterns. This week we will see how "unique" needs are in fact nothing but common ordinary development projects.

Beginning With the Conclusion

I have had this conversation with many programmers over the past few years, and it always follows the same patterns. The easy part of the argument is showing the programmer that what he thinks is special or unique is in fact common. The much harder part, because it involves the delicate human ego, is showing the programmer that he has not seen this because he is ignorant. This is not fun to do and I myself usually skip it, it's usually not worth the trouble.

Path 1: Details

Occasionally I speak to a programmer who thinks he has a unique situation. His claim begins with the mountain of details he must handle, details which appear to be contradictory, subtle, and overall perplexing. He wonders if some new approach is required to handle them.

In answering this claim, we begin with the easy part, showing that the situation is itself not unique. In short, all worthwhile projects involve mountains of detail, so there is nothing special there. When it comes to the subtleties and the maze of exceptions and special cases, these are common in mature businesses that have evolved this complexity in response to business needs over the years. So again there is nothing unique here, the programmer's situation is again common.

At this point we have to ask how the programmer will deal with this perplexing mountain of detail. If he knows what he is doing, he will give the general answer that he is going to break it down as much as possible into independent smaller problems that can be solved on their own. Since this is nothing more than how all programmers solve complex problems, the entire "uniqueness" claim has completely collapsed. His project is utterly common.

The much harder part of the conversation comes if the programmer does not know how to break down the problem. For instance, if the problem is all about a fiendishly complex pricing system with lots of discounts and pricing levels, and the programmer does not know that he needs to begin with the database, and he further does not want to hear that, well, there is not much I can do for him. He will end up working a lot harder than he needs to, and will probably remain convinced he is dealing with something "unique".

But let's go a little deeper into that example of the complicated pricing system. Why do I claim that he must start with the tables, and that is he is wasting time if he does not? Well, a complete answer is much more than will fit here, and in fact I hit that theme over and over in these essays, but it comes down to:

  • He must have an accurate and precise description of the details that govern the pricing scheme. That is what tables are for.
  • In working out the mechanics of the tables, particularly their primary and foreign keys, he will come to a his most complete understanding of the mechanisms involved.
  • When the tables completely reflect the details he must work with, the code will just about write itself.
  • Lastly, but probably most importantly, the customer will expect to control the pricing system by adjusting the parameters at all levels. Again, that is what tables are for. The user is in control of the pricing system if he can edit the tables (because of course he cannot edit the code).

Path 2: Combinations

Once upon a time we had simple desktop business applications, games, and then this weird new thing, "the web". Now they are all mixed together, as we play games on the internet that are tied into huge databases. Modern applications often combine technologies that used to be comfortably separate. On any particular project, some of the requirements look like they can be met with an RDBMS, some require management and delivery of media such as MP3 or video, and he is told as well he must provide RSS feeds and import data coming in XML format. Perhaps as well there will be stone tablets and papyrus scrolls.

This programmer may believe he is in a unique situation because of this combination of needs. Because no single toolset out there can meet the entire project, perhaps this is something never before seen? But this does not hold up. Just like the argument about complexity, he must break the problem up correctly, and when he has done so he will have a perfectly ordinary project. Though I might add it will also be a very interesting project and probably a lot of fun.

In The End It Is All About Patterns

I have given two examples above taken from my own experience where programmers have claimed to me that they faced some unique situation. There are many other cases, and they always make perfect sense to the person who thinks he has discovered something new. The biggest flaw in the programmer's thinking is failing to distinguish between particulars and patterns.

My claim in this essay is that the patterns of all problems are the same. Somebody has seen it before, somebody has done it before, the answer is out there. The process of analysis and programming is about slotting your particulars in the patterns that have already been established.

In the broadest sense all programs process data, and particular programs break down into broad patterns of data access and manipulation. Sometimes you have a broad range of users putting in data with very little massaging (think twitter) and sometimes you have one group controlling much of the data while others make use of it (think Amazon), and sometimes your data is mostly relational and table based (think any ecommerce or biz app) and sometimes its mostly media (think youtube).

Once you have these broad patterns identified, you can then proceed to make use of established practices within each particular area. What is the best way to provide sensitive data on the web and protect it from unauthorized eyes? Somebody has done it before. What is the best way to track large amounts of media? Somebody has done it before. What is the best way to set up a complex pricing system with lots of discounts and pricing levels? Somebody has done it before. In all cases, your particulars may be different, but the patterns will be the same.

Conclusion: Find the Patterns

Whenever I find myself looking at a situation that appears to be new, I try to tell myself that it may be new to me, but it is not likely to be new to the human race. If it does not appear to follow a well-known pattern then I proceed as if I have not yet recognized the pattern and continue to analyze and break it apart until the pattern emerges. So far it always has.

Sunday, January 25, 2009

The Data Dictionary and Calculations, Part 2

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Simple Case Is Not Much Help

We will begin by examining a simple case of a shopping cart. We have columns QTY and PRICE, and we want to add the column EXTENDED_PRICE that will contain PRICE * QTY. Our dictionary might look something like this:

table orderlines:
    column price:
        # see last week's essay for details on FETCH
        automation_id: FETCH
        auto_formula: items.price
    # A user-entered value, no automation
    column quantity:
    # The extended price:
    column extended_price:
        automation_id: extend
        auto_formula: price * qty

This seems simple enough, we have specified the formula right in the table definition, and now we are free to make use of that formula in any way we want -- either by generating code or interpreting it at run-time.

Unfortunately it is a bad idea to start coding right now with this example. The problem is that it is too simple, and will lead us down paths that cause great problems when we hit more complex cases. We must begin with a more complex case before we consider how to use this formula in our framework.

Not All Items Are Taxable

Consider the case where you have a shopping cart online and you must collect sales tax, but not all items are taxable. This means you need conditional logic of some sort, you must look at a flag and then decide whether to add tax. Here is a first stab at what this might look like:

table orderlines:
    # We'll skip the details on these columns for now
    column price:    
    column quantity:
    column flag_taxable:
    column taxrate:
    # We need to know the extended amount, that is
    # what we will tax
    column extended_amount:
        automation_id: extend
        auto_formula: price * qty
    # Here is the column that matters
    column tax:
        automation_id: extend
        auto_formula: CASE WHEN flag_taxable = 'Y'
                           THEN taxrate * extended_amount
                           ELSE 0 END

While this looks like a simple enough extension to the first example, it gets us into a thorny decision, the decision between parsing and assembling

Parse Versus Assemble

Before I get into the parse vs. assemble, question, let me pull back and explain why the example bothers me, and why it is worth an entire essay to discuss. In short, we intend to use the dictionary to implement a radical form of DRY - Don't Repeat Yourself (see The Wikipedia article on DRY.) Once we have specified the formula in the dictionary, we want to use it for all code generation and docs generation at very least, but we may also want to refer to the formulas in Java code (or PHP, Python, Ruby etc.) or even in Javascript code on the browser.

But the problem with the example is that it is coded in SQL. In the form I presented, it can be used for generating triggers, but not for anything else, unless you intend to use a parser to split it all up into pieces that can be reassembled for different presentations. The example as written is useful only for a single purpose -- but everything in our dictionary ought to be useful at any layer in the framework for any purpose.

But it gets worse. What if the programmer uses a dialect of SQL aimed for one platform that does not work on another? To guarantee cross-server compatibility, we not only have to parse the phrase, but then re-assemble it.

There is a third argument against the use of SQL expressions. We may be able to parse the expression and satisfy ourselves that it is valid, but that still does not mean it will work -- it may refer to non-existent columns or require typecasts that the programmer did not provide. This leads to one terrible event that you ought to be able to prevent when you use a dictionary: having an upgrade run successfully only to hit a run-time error when somebody uses the system.

A much simpler method is to assemble expressions by having the programmer provide formulas that are already cut up into pieces.

The Assembly Route

So I want to have formulas, including conditionals, and I want to be able to use the formulas in PHP, Java, Javascript, inside of triggers, and I want to be able to generate docs out of them that do not contain code fragments, and I want to be able to guarantee when an upgrade has run that there will be no errors introduced through programmer mistakes in the dictionary. The way to do this is to specify the formulas a little differently:

    column taxable:
            case 00:
                compare: @flag_taxable = Y
                return: @taxrate * @extended_amount
            case 01:
                return: 0

Here are the changes I have made for this version:

  1. The programmer must specify each case in order
  2. Each case is a compare statement followed by a return
  3. A case without a compare is unconditional, it always returns and processing ends
  4. I stuck little '@' signs in front of column names, I will explain those in a moment.

In short, we want the programmer to provide us with the conditional statements already parsed out into little pieces, so when we load them they look like data instead of code. We now have the responsibility for assembling code fragments, but in exchange we have pre-parsed data that can be handed to any programming language and used.

Conclusion: Assembly Means Data

The decision to go the assembly route is simply another example of the Minimize Code, Maximize Data principle. The dictionary itself should be composed entirely of data values, no code snippets should be allowed to sneak in. The reason is simple. No matter what route we follow we will have to validate and assemble the formula - be it for PHP, Javascript, or an alternate database server. But if we let the programmer give us code snippets we have the extra burden of parsing as well. Who needs it?

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

Sunday, January 18, 2009

The Data Dictionary and Calculations, Part 1

The stunning power of a data dictionary comes into play once the dictionary contains formulas for calculated values. The dictionary can then be used to generate code, and also to generate documentation. This double-win is not available without the calculations because the resulting docs and database would be incomplete, requiring tedious and error-prone manual completion.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Calculations and Normalization

Before I begin, I will point out that all calculated values stored in a database are denormalizing, they all introduce redundancies. This does not mean they are bad, it just means you need a way to make sure they stay correct (see Keeping Denormalized Values Correct, also see Normalization and Automation). If you cannot keep them correct, they will get very bad very fast. This essay will show you one approach to ensuring calculated values are always correct.

However, before I start, I have to point out how important it is to begin by normalizing your database (to at least 3NF) and adding calculations only upon the strong foundation of a normalized database. If you do not normalize first, you will discover that it is impossible to work up formulas that make any sense -- values will always seem to be not quite where you need them, and it will always seem you need one more kind of calculation to support, and it will be very difficult to write the code generator that gives strong results. But if you build on a normalized database, it turns out you only need a few features in your dictionary and your code generator.

Use Denormalization Patterns

Once you have normalized your database, you will find that your calculations all fall into three basic categories (detailed in April 2008 in Denormalization Patterns). These three patterns are:

  • FETCH operations, like copying an item's price from the ITEMS table to the ORDERLINES table.
  • EXTEND operations, which are calculations within a row, such as assigning EXTENDED_PRICE the value of QUANTITY * PRICE.
  • AGGREGATE operations, like a SUM of the lines of an order to the order header.

This week we will look at the first type of operations, the FETCH operations.

Putting the FETCH Into Your Data Dictionary

So we have an ORDERLINES table, and it contains a PRICE column, and the value of that column should be copied from the ITEMS table. This is an extremely common operation in most database applications, so we decide it would be really cool if we could specify that in the data dictionary and have the code generator take care of it. This would chop a lot of labor off the development process.

Here is how a column like this would appear in my own dictionary format:

table orderlines:
    description: Order Lines
    module: sales
    column price:
        automation_id: fetch
        auto_formula: items.price
    ...more columns...

This looks nice, I have put the formula for the PRICE column into the data dictionary. Now of course I need that formula to get out into the application somehow so that it will always be executed and will never be violated. We will now see how to do that.

The Trigger Approach

When it comes to code generators, if there are ten programmers in a room, there are going to be at least 10 opinions on how to write and use a code generator (the non-programmer boss will also have an opinion, so that makes 11). I have no interest in bashing anybody's approach or trying to list all of the possibilities, so I will stick with the approach I use myself, which is to generate database trigger code. If you want to know why that approach works for me, check out Triggers, Encapsulation and Composition.

When I work on code generators, I begin by manually coding an example of what I'm getting at, so I know it works. The trigger snippet we are looking for must do two things. It must make sure the price is always copied, and it must make sure that no user can subvert the value. This snippet (which is in the PostgreSQL flavor of server-side SQL) does this on an insert:

-- top of trigger....

    -- PART 1: Prevent users from subverting the
    --         the formula by throwing error if they
    --         try to supply a value:
    IF new.price IS NOT NULL THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    -- PART 2: If the value of SKU exists, use it to look
    --         up the price and copy it into the new row
    IF new.sku IS NOT NULL THEN 
      SELECT INTO new.price par.price
        FROM items par 
       WHERE new.sku = par.sku ;
    END IF;

-- more trigger stuff

NOTE! You may notice my trigger code somehow seems to "know" to use the SKU column when searching the ITEMS table, yet my formula did not specify that. I am assuming your data dictionary contains definitions of primary keys and foreign keys, otherwise it is of no real use. I am further assuming that when I see the formula to "FETCH" from the ITEMS table, I can look up the foreign key that matches ORDERLINES to ITEMS and find out what column(s) to use.

The example above works on INSERT operations only. You need a slightly different version for updates, which throws an error if the user attempts to change the price, and which does a new FETCH if the user has changed the SKU value.

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.price par.price
         FROM items par WHERE new.sku = par.sku ;
    END IF;

Sidebar: A Complete Trigger

If you want a teaser on how many amazing things the trigger can do once you've loaded up your dictionary and builder with features, here is a bit of code from a demo application. Most everything in it will get treated in this series on the data dictionary.

CREATE OR REPLACE FUNCTION orderlines_upd_bef_r_f()
  RETURNS trigger AS
    NotifyList text = '';
    ErrorList text = '';
    ErrorCount int = 0;
    AnyInt int;
    AnyInt2 int;
    AnyRow RECORD;
    AnyChar varchar;
    AnyChar2 varchar;
    AnyChar3 varchar;
    AnyChar4 varchar;
    SET search_path TO public;

    -- 1010 sequence validation
    IF (new.recnum_ol <> old.recnum_ol)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,3002, may not be re-assigned;';
    END IF;

    -- 1010 sequence validation
    IF (new.skey <> old.skey)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'skey,3002, may not be re-assigned;';
    END IF;

    -- 3100 PK Change Validation
    IF new.recnum_ol <> old.recnum_ol THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,1003,Cannot change value;';
    END IF;
    -- 3100 END

    IF new.flag_taxable <> old.flag_taxable THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'flag_taxable,5001,may not be explicitly assigned;';
    END IF;

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
   IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.flag_taxable
         FROM items par WHERE new.sku = par.sku ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_retail <> old.amt_retail THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_retail,5002,Cannot assign value directly to column amt_retail ;';
        new.amt_retail =  CASE WHEN  1 = 1  THEN new.price*new.qty        ELSE 0 END ;
    END IF;

    IF new.pct99_discount <> old.pct99_discount THEN 
       IF new.pct99_discount <> (SELECT par.pct99_discount FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'pct99_discount,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxauth <> old.taxauth THEN 
       IF new.taxauth <> (SELECT par.taxauth FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxauth,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxpct <> old.taxpct THEN 
       IF new.taxpct <> (SELECT par.taxpct FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxpct,5001,may not be explicitly assigned;';
       END IF;
    END IF;
   IF  coalesce(new.recnum_ord,0) <> coalesce(old.recnum_ord,0)  THEN 
       SELECT INTO new.pct99_discount
         FROM orders par WHERE new.recnum_ord = par.recnum_ord ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_discount <> old.amt_discount THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_discount,5002,Cannot assign value directly to column amt_discount ;';
        new.amt_discount =  CASE WHEN  1 = 1  THEN new.amt_retail*new.pct99_discount*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_net <> old.amt_net THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_net,5002,Cannot assign value directly to column amt_net ;';
        new.amt_net =  CASE WHEN  1 = 1  THEN new.amt_retail-new.amt_discount        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_tax <> old.amt_tax THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_tax,5002,Cannot assign value directly to column amt_tax ;';
        new.amt_tax =  CASE WHEN new.flag_taxable = 'Y' THEN new.amt_net*new.taxpct*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_due <> old.amt_due THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_due,5002,Cannot assign value directly to column amt_due ;';
        new.amt_due =  CASE WHEN  1 = 1  THEN new.amt_net+new.amt_tax        ELSE 0 END ;
    END IF;

    -- 7010 Column Constraint
    new.flag_taxable = UPPER(new.flag_taxable);
    IF NOT (new.flag_taxable  IN ('Y','N')) THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'new.flag_taxable,6001,Column -Taxable- can be either Y or N;';
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.sku IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'sku,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.sku IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: sku;';
        SELECT INTO AnyInt COUNT(*) FROM items par 
            WHERE par.sku = new.sku;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'sku,1006,Please Select Valid Value: ' || new.sku::varchar || ';';
        END IF;
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.recnum_ord IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ord,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.recnum_ord IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: recnum_ord;';
        SELECT INTO AnyInt COUNT(*) FROM orders par 
            WHERE par.recnum_ord = new.recnum_ord;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'recnum_ord,1006,Please Select Valid Value: ' || new.recnum_ord::varchar || ';';
        END IF;
    END IF;

    IF ErrorCount > 0 THEN
        RAISE EXCEPTION '%',ErrorList;
        RETURN null;
        IF NotifyList <> '' THEN 
             RAISE NOTICE '%',NotifyList;
        END IF; 
        RETURN new;
    END IF;
  COST 100;
ALTER FUNCTION orderlines_upd_bef_r_f() OWNER TO postgresql;

Variatons on FETCH

I have found two variations on FETCH that have proven very useful in real world applications.

The first I call DISTRIBUTE. It is dangerous because it can be a real performance killer, and turns out you very rarely need it. However, that being said, sometimes you want to copy a value from a parent table down to every row in a child table when the value changes in the parent. The first time I did this was to copy the final score from a GAMES table into a WAGERS table on a fake sports betting site.

The other variation I have found useful is FETCHDEF, my shorthand for "fetch by default." In this variation the user is free to supply a value of their own, but if they do not supply a value then it will be fetched for them.

The Code Generator Itself

As for writing the code generator itself, that is of course far more than I can cover in one blog entry or even 10. Morever, since anybody who decides to do so will do so in their own language and in their own style, there is little to be gained by showing code examples here.

Conclusion: Expand Your Dictionary!

If you make up a data dictionary that only contains structure information like columns and keys, and you write a builder program to build your database, you can get a big win on upgrades and installs. However, you can take that win much farther by adding calculated values to your database and expanding your builder to write trigger code. This week we have seen what it looks like to implement a FETCH calculation in your dictionary and what the resulting trigger code might look like.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

Sunday, January 11, 2009

Upgrading Indexes With a Data Dictionary

When you set out to use a Data Dictionary to control your database upgrades, you must consider not just columns and tables, but also indexes and keys. The process for adding indexes and keys is almost the same as that for columns and tables, but there are a few wrinkles you have to be aware of.

Review of Basic Ideas

In my First Essay On Data Dictionaries, one of the major points was that the dictionary is easiest to use if it is in some type of plaintext format that is in source control along with the rest of your application files, and is the processed with a "builder" program.

Last week we saw the Basic compare operation used by the builder to build and update tables. You read in your data dictionary, query the information_schema to determine the current structure of the database, and then generate commands to add new tables and add new columns to existing tables.

The Importance of Keys and Indexes

If a builder program is to be useful, it must be complete, if it leaves you with manual tasks after a build then the entire concept of automation is lost. The builder must be able to build the entire structure of the database at very least, and this means it must be able to work out keys and indexes.

The Basic Steps

The basic steps of building indexes and keys are these:

  • Load your dictionary to some format you can work with easily. I prefer to load it to tables.
  • Query the databases's INFORMATION_SCHEMA to determine which indexes and keys already exist.
  • Execute some type of diff to determine which indexes need to be built.
  • Build the indexes that are not there.
  • If you like, drop the indexes that are not in the spec.

Sidebar: Dropping Indexes

A builder program can add things, and it can also drop things. When it comes to destructive operations, I prefer not to have my builder drop tables or columns, because the consequences of a mistake can be unthinkable.

However, when it comes to indexes, it is much more likely to be ok to drop a stray index. Dropping indexes does not destroy user data. Also, extraneous indexes will slow down inserts and updates, so getting rid of them is usually the Right Thing.

Step 1: Your Specification

Your data dictionary format must have some way of letting you specify an index. It is also a good idea to allow you to specify an ascending or descending property for each column, and to specify if the index is to be unique (effectively making it a unique constraint).

Here is an example of a very simple set of indexes:

table example:
    description: My example table
    index first_name:
        column first_name:
    index last_name:
        column last_name:
    index socialsec:
        unique: "Y"
        column socialsec:
    # ... column definitions follow...

I am currently working on a program that requires frequent access by three columns, where the first two are in descending order but not the first. An index spec for this might look like:

table shipments:
    description: Incoming Magazines
    index history:
        column bipad:
        column year:
            flag_asc: "N"
        column issue:
            flag_asc: "N"

    # ... column definitions follow...

As far as loading this into memory, I covered that in some detail last week and will not dwell on it here. I will simply assume you have code to parse and load the spec to a format that works for you.

Step 2: The Information Schema or Server Tables

When I set out to write my builder, I found that the information_schema was a bit more complicated than I needed. The server I was using, Postgres, had a simpler way to get what I wanted. I also found I would get all kinds of extraneous definitions of indexes on system tables or tables that were not in my spec. The query below was the easiest way to get index definitions that were limited to the tables in my spec on the Postgres platform:

Select tablename,indexname,indexdef 
 FROM pg_indexes
 JOIN zdd.tables on pg_indexes.tablename = zdd.tables.table_id
 WHERE schemaname='public'

As far as primary keys and foreign keys go, the story is basically the same, your server may provide them in a convenient way the way Postgres gives index definitions, or you may have to dig a little deeper to get precisely what you want.

Step 3: The Diff

So now we have a picture of the indexes we need to exist, and the indexes that already exist. It is time to look at how to diff them effectively. This step does not work the same way as it does with columns and tables.

Before we go into how to do the diff, let's review how we did it with tables and columns. We can basically diff tables and columns by name. If our spec lists table CUSTOMERS and it does not appear to exist in the database, we can build the table CUSTOMERS, simple as that. But with indexes the name really does not mean anything, what really matters is what columns are being indexed.

This is why we diff indexes on the column definitions, not on their names. If you want a complete trail, you would begin with this table that describes your own indexes:

CUST1       | CUSTOMERS | zipcode:state
HIST1       | HISTORY   | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue

Then you pull the list of indexes from the server, and lets say you get something like this:

CUST1     | CUSTOMERS | zipcode:state
ABCD      | HISTORY   | bipad:year:issue
ORDER1    | ORDERS    | year:issue

When you join these two together, you are matching on TABLE and COLUMNS, we do not care about the index names. A query to join them might look like this:

SELECT spec.spec_name,spec.table,spec.columns
      ,db.db_name,db.columns as db_cols
  FROM spec
  FULL OUTER JOIN db   On spec.table = db.table
                      AND spec.columns = db.column

This query would give us the following output:

CUST1       | CUSTOMERS | zipcode:state    | CUST1   | zipcode:state
HIST1       | HISTORY   | bipad:year:issue | ABCD    | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue |         |
            |           |                  | ORDER1  | year:issue

Now let us examine the results row by row.

  • The first row shows that the index on zipcode+state on the customers table is in the spec and in the database, we take no action on that index.
  • The second row shows that the index on bipad+year+issue is also in both the database and the spec. This particular index has a different name in the database, but we don't care. (Maybe the programmer changed the name in the spec). We take no action on this index.
  • The third line shows an index on the ORDERS table that is not in the database, we must build that index.
  • The fourth line shows an index in the database that is not in the spec, you can drop that if you want to.

The Rest of It

From here it is a simple matter to generate some commands to create the indexes we need.

Keys work the same way, with a few obvious differences in how they might be named.

We can add features from here to track if the columns are being indexed in ascending or descending order.

Conclusion: Indexes Go By Definition

When writing a database upgrade "builder" program, they key thing to understand about indexes and keys is that you are looking to indentify and build indexes according to their definition, and that names do not matter at all.

Sunday, January 4, 2009

Dictionary Based Database Upgrades

The number one search term that brings people to this blog is "data dictionary." So this week I will begin a series on how to use the data dictionary to improve your own productivity and reduce errors.

Building And Upgrading

This week we are going to see how to use a data dictionary to eliminate upgrade scripts (mostly) and make for more efficient upgrades. The approach described here also works for installing a system from scratch, so an install and an upgrade become the same process.

The major problems with upgrade scripts are these:

  • They are the least-tested code in any system, and are the most likely to break. If a script breaks and anybody but the original programmer is running the upgrade, this leads to aborted upgrades and upset customers.
  • They are horribly inefficient when a customer upgrades after a long time: the same table may be rebuilt many times as script after script adds a column or two.

By contrast, a dictionary-based upgrade can take any customer from any version of your software and in the fewest steps possible bring them completely current, with no possibility of broken scripts.

But first, a quick review of an important idea...

Review: The Text File

Back in June of 2008 this blog featured an overview of the data dictionary. There were many ideas in that essay, but I wish to focus on one in particular, the question of where to put the data dicationary and in what format.

In terms of where to put it, the data dictionary should be just another application file, in source control, and delivered with your code. When the dictionary is in a plaintext file (or files) and treated like other application code, you do not have to invent any new methods for handling it, just use the same methods you use for the rest of your code.

In simple practical terms, it is best if a data dictionary can be easily read and written by both people and computers. This leads to a plaintext file or files in some format such as JSON or YAML. I personally prefer YAML because it is a superset of JSON, so using YAML gives me the option to sneak in JSON syntax later if I choose, but starting with JSON does not let me go the other way.

Requiring easy handling by people tends to rule out XML, which is bletcherous to work with manually (in my humble opinion). Requiring readability by the computer rules out UML unless your UML drawing tool can produce a usable data file (comments always welcome, tell us your favorite tool for doing this!). When considering UML, it is the class diagrams that are most likely to be translatable into a data dictionary.

Finally, encoding dictionary information in program class files technically meets the practical requirements listed above, but it has the disadvantage of trapping data in code, which unnecessarily couples your dictionary to whatever language you are using at the moment. It is much better if the dictionary sits outside of the code as pure data. Not to mention that spreading the dictionary out in a collection of one-class-per-table files makes it much harder to do upgrades in the way I am about to describe.

Review of Steps

When using a dictionary-based approach, you write some type of "builder" program that reads your dictionary file, examines the current structure of the database, and then generates SQL commands to alter and create tables to make them all current.

There are plenty of ways to do this. My own approach is to load the dictionary itself into tables, pull the current state into similar tables, and then do queries to find new and altered tables and columns. If you want to see a full-blown program of this type, check out androBuild.php, the Andromeda implementation of this idea. The routines that apply to today's topic include "RealityGet()", "Differences()", "Analyze()" and "PlanMake()".

Step 1: Load the Dictionary to RAM

To use the approach in this essay, you begin by parsing your plaintext file and loading it to tables. Here is a simple example of a what a dictionary file might look like in YAML format:

table states:
    description: US States
            type: char
            colprec: 2
            caption: State Code
            primary_key: "Y"
            type: varchar
            colprec: 25
            caption: State Name

If you are using PHP, you can parse this file using the spyc program, which converts the file into associative arrays. All or nearly all modern languages have a YAML parser, check out the YAML site to find yours.

Step 2: Load the Dictionary To Tables

The database you are building should have some tables that you can use as a scratch area during the upgrade. You may say, "The builder gives me tables, but I need tables to run the builder, how can I do this?" The simplest way is to hardcode the creation of these tables. A more mature solution would use a separate dictionary file that just defines the dictionary tables.

The structure of the tables should match the data file, of course. Here is what the YAML above would like like after being loaded to tables:

states  | US States  

states  |state       |State Code |char    |2          
states  |description |State Name |varchar |25

Step 3: Fetch The Current State

All modern databases support the "information_schema" database schema, a schema inside of each database that contains tables that describe the structure of the database. While you can make queries directly against the information_schema tables, I prefer to fetch the information out of them into my own tables so that all column names are consistent with my own. A simple query to do this might look like this:

-- Postgres-specific example of pulling info out of the
-- information_schema table:

insert into TABLES_NOW (table_id)  -- my dictionary table
SELECT table_name as table_id 
  FROM information_schema.tables  
 WHERE table_schema = 'public'
   AND table_type = 'BASE TABLE'

Pulling column information out can be much more complicated, owing to differences in how vendors implemement information_schema, and owing to the complex way data is stored in it. Here is my own code to pull out the definitions of columns from the Postgres information_schema, which also simplifies the definition dramatically, to make my downstream coding easier:

insert into zdd.tabflat_r 
 SELECT c.table_name,c.column_name, 
        CASE WHEN POSITION('timestamp' IN data_type) > 0 THEN 'timestamp'
           WHEN POSITION('character varying' IN data_type) > 0 THEN 'varchar'
           WHEN POSITION('character' IN data_type) > 0 THEN 'char'
             WHEN POSITION('integer' IN data_type) > 0 THEN 'int'
             ELSE data_type END,
        CASE WHEN POSITION('character' IN data_type) > 0 THEN character_maximum_length
         WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_precision 
     ELSE 0 END,
        CASE WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_scale
         ELSE 0 END
   FROM information_schema.columns c 
   JOIN information_schema.tables t ON t.table_name = c.table_name  
  WHERE t.table_schema = 'public' 
    AND t.table_type   = 'BASE TABLE'");

Step 4: The Magic Diff

Now we can see how the magic happens. Imagine you have 20 tables in your application, and in the past week you have modified 5 of them and added two more. You want to upgrade your demo site, so what is the next step for the builder?

The builder must now do a "diff" between your dictionary and the actual state of the database, looking for:

  • Completely new tables.
  • New columns in existing tables.

Lets say you have two tables, "TABLES_SPEC" which lists the tables in your application. Then you have "TABLES_NOW" that lists the tables in your database. The following query will give you a list of new tables:

SELECT spec.table_id
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id

It is now a simple thing pull the column definitions for each table and generate some DDL to create the tables.

But we also have tables that have new columns. We can pull those out like so:

 -- the first where clause gets new columns
 WHERE not exists (
       SELECT table_id FROM COLUMNS_NOW now
        WHERE spec.table_id = now.table_id
          AND spec.column_id= now.column_id
   -- this second subquery makes sure we are
   -- getting only existing tables
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id

Now again it is a simple matter to generate DDL commands that add all of the new columns into each table. Some databases will allow multiple columns to be added in one statement, while others will require one ALTER TABLE per new column (really horrible when you have to do that).

Please note this is sample code only, just to give you ideas, and it will not cover every case.

Sidebar: Avoid Destructive Actions

Do not rush into writing code that drops columns or tables that are not in your spec. The results of a misstep can be disastrous (as in lose your job or your customer). My own builder code is now 4 1/2 years old and I have never yet bothered to write a "destructive" upgrade that will clean out unused tables and columns. Maybe someday...

What I Left out: Validation

There was no space in this essay to discuss a very important topic: validating the spec changes. It may be that a programmer has done something nasty like change a column type from character to integer. Most databases will fail attempting to alter the column because they don't know how to convert the data. Your builder program can trap these events by validating the upgrade before any changes are made. This will be treated fully in a future essay.

More that I Left Out: Indexes, Keys...

There are many many other things you can and really must create during the build, beginning with primary key and foreign keys, not to mention indexes as well. These will be covered in a future essay.

More That I Left Out: When You Still Need Scripts

There are plenty of reasons why you may still need a few upgrade scripts, these will be discussed in a future essay. They all come down to moving data around when table structures change significantly.

Conclusion: One Upgrade To Rule Them All

The approach described this week for upgrading databases has many advantages. It is first and foremost the most efficient way to upgrade customers from any version directly to the latest version. It is also the simplest way to handle both installations and upgrades: they are both the same process. Putting the dictionary file into plaintext gives you complete source control just like any other application file, and overall you have a tight, efficient and error-free upgrade process.