Sunday, November 2, 2008

Keeping Denormalized Values Correct

A normalized database stores each fact in exactly one place. This makes for very robust write operations, it is much easier to get things right on the way in. But it becomes much harder to get things out efficiently or easily, so very often we denormalize, that is, we store facts in more than one place for easier retrieval. This requires a very well thought out strategy to make sure these repeated values are always correct.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Review of Methods

For our example this week we will consider a shopping cart. The orders are kept in the ORDERS table and the items purchased are in the ORDER_LINES table. We have denormalized the database by keeping the value ORDER_TOTAL in the ORDERS table. Every time a line item is changed, the ORDER_TOTAL must be updated.

There are a handful of methods that are popular in the wild for doing this. Some practices emphasize programmer discipline, others seek to prevent actions that will lead to inccorrect values. Strategies also fall into application level or database level, while still others operate at the architecture level.

Personally I chose triggers about four years ago, which I will explain at the end of this essay.

The Weakest Approach: Discipline

The simplest approach is to require that programmers be made aware of all denormalized values and that they must remember if they modify a line item to update the ORDER_TOTAL. This works well enough actually for small programming teams, where there is only one or maybe two programmers, preferably sitting right next to other. Mind-reading helps here as well.

Of course this approach falls apart like a rotten burlap bag as soon as the team or the program exceeds the ability of the team to keep it all straight in their heads.

But I did not bring up this example just to ridicule it. I have found that even seasoned veteran programmers (including your humble author) will fall into the trap of trying to enforce conventions at many levels in their programming. To see why this is always a bad idea and should always be avoided, consider this hypothetical case.

Imagine a new programming language is introduced known as Super-G, which is fashionable and wonderful and everybody loves it. It has a well-thought out typing system except for one odd behavior: If you code a line that concatenates a string with a date, program execution ends with no error. The compiler does not trap for this and run-time does not tell you why it quits. You can Google for it and find out that everybody knows about it, and you just have to remember not to do that! The language's authors have no plans to fix it because nothing is wrong. The fully expec you to always remember never to concatenate strings and dates.

The example is meant to be absurd, but to reinforce that any strategy where you just have to remember is out of the running from the start. Since we would not accept this in any tool we use, we should certainly never build our own practices upon such sand, and certainly we would not count on it to keep denormalized values correct.

Limiting Access To The Database

The next simplest strategy is to prune down what agents (programs or users) can get at the database. The idea is simple: just let one program get at the database, make sure that program is correct, and force everybody to go through the application.

This will work if you can get your programs right and there is no chance that any of the check-signers will demand access except through your application. Many programmers believe this is true for them. Some of them are right, but many are not: their users would love to get access to the database but the programmer has created a situation where it is impossible.

Personally I try to avoid this approach completely, and my reasons are both philosophical and technical.

On the technical side, successful programs always expand in scope, and the demand for flexible database access always increases. Limiting access to the database means that eventually you have to recode the entire database interface. This means work for you, cost to the customer, and work for the customer in plugging into whatever interface you create. This may be doable, but the overriding fact is that databases already have an interface, and any time spent re-inventing it could better be spent on just about anything.

On the philosophical side I simply do not like any architecture where limitations are built in from the start. Call it a personal prejudice, but I much prefer to find the flexible solution where there is one (and personally I love to find it where it appears it does not exist). Overall the flexible solution always leads to more possibilities for work, more features, and just plain more fun.

Application Framework Strategies

If you are committed mainting the ORDER_TOTAL in application code, and you wish to avoid the "please remember to always...." blunder, then it must not be possible for new programmers or prima donna programmers to violate the requirement. This means your framework cannot allow random SQL commands, and must somehow force all write access to particular tables to route through particular objects or functions. A good ORM system should at very least not only provide a mechanism for updating related tables, but also prevent any access except through that mechanism.

Personally I have no use for these kinds of systems, for reasons explained in the previous section, and so I cannot really comment on them beyond describing these basic minimum requirements.

Server-Side Strategy: Stored Procedure

A few years ago I was working in Manhattan and a fellow programmer explained that at his wife's job all database write access had to go through stored procedures. The idea was to ensure that business rules were always enforced and to prevent any programmer from wittingly or unwittingly violating the rules. In the interest of full disclosure, I'll point out that I have never worked on such a system, and all of my knowledge of such systems is second-hand or third-hand.

With that being said, the obvious up-side to this method is that you avoid forcing database access through your application, making things much more robust and extendable. Further, you make sure, by coding up routines that handle UPDATES and INSERTS to ORDERS and ORDER_LINES that the useful but troubesome ORDER_TOTAL value is always updated when it needs to be. Further still, you can tie security to the stored procedures and control who can modify orders, which is a prime feature mentioned by everybody who has ever explained such a system to me.

There is a significant down-side if you intend to code the stored procedures manually. My own experience is that server-side code is the most difficult to debug (please feel free to post a comment trumpeting your favorite debugger for stored procedures, I'm all ears).

I have never been tempted to use a system like this because I believe it is still exactly one level more complicated than it needs to be. What I really want is to be able to directly code an INESRT to the ORDER_LINES table from any source and know the ORDER_TOTAL field will always be correct. If that were possible, then all parties are liberated from inventing and then using any API except SQL. Now of course many of us prefer to build some layer on top of SQL (myself inclued), but if the architecture supports direct SQL while enforcing business rules then all parties are free to use abstraction layers of their choosing, and nobody is forced to invent or accomodate anything they do not wish to.

Server-Side Stragey: Triggers

It is a simple technical fact that the tightest possible encapsulation of code and data occurs when you attach triggers to tables. In our example of the ORDER_TOTAL value, any INSERT, UPDATE, or DELETE to the ORDER_LINES table would update the ORDER_TOTAL in the ORDERS table. This approach gives maximum flexibility: you can directly access the database without violating rules, and any player can use an abstraction layer of their choice, or none at all.

Since many programmers find it very tedious and error-prone to code and debug server-side routines, this approach still faces a large obstacle if you intend to code the triggers by hand. But this should not be necessary when taking this approach, because all denormalization will follow patterns. This is a theme that I tend to repeat over and over in these essays: your tables will all follow predictable patterns and your denormalizations will likewise follow patterns. Whenever you have patterns you can have automation, and in this case that means generating the triggers instead of coding them by hand.

Another concern with this approach is security. I have been stressing the inevitable need for expanded database access as your application matures, but if you let somebody in with full priveleges, they could accidentally or maliciously cause huge damage if they can run willy-nilly wherever they want in the database. The trigger-based approach is the tightest possible way to enforce business rules, but it does nothing to address security. And if you end up granting database access based on confidence in triggers, then you are forced into enforcing security as well inside of the database -- but that is an essay for another day.

Conclusion

Once we decide to denormalize then we are required to dream up a strategy to keep things correct going in. The weakest strategies depend upon voluntary adherence to some set of conventions, and many strategies accept limitations in overall flexibilty to reduce the threat from unknown elements. The trigger option, not very popular these days, provides the tightest encapsulation of code and data, and lends itself well to code generation.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are: