Sunday, January 27, 2008

Table Design Pattern: Limited Transaction

Welcome to the Database Programmer. This is a blog for anybody who wants to learn about databases. The entries are meant to be simple and easy to read but definitely not dumbed down. Because most of us these days work on web sites of one sort or another, and since all non-trivial websites require a database, there is very good reason to learn how databases really work.

There is a new entry every Monday morning, and the complete table of contents is here. We are currently looking at Table Design Patterns and how they lead to tight and efficient code.

The Example: A School Class Schedule

In this series we are using the example of an application that manages a school of a few dozen faculty and few hundred or perhaps a couple thousand students. Each year the school administration must make up the actual class assignments for each teacher, including what classroom and period each class will be taught in. Then students must be assigned into the classes.

There are five rules that must be followed:

  1. A teacher may not teach a class he/she is not qualified to teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

Last week's entry showed that rule 1 was an example of the Cross Reference Validation Pattern, and this week we are going to see that rules 2 and 3 are an example of the Limited Transaction pattern. Next week we will look at rules 4 and 5, which deal with the student.

Sidebar: Discovered Requirements

This week the rules are interesting because they are the kind that nobody would ever actually tell you. I call such rules discovered requirements because they are usually discovered by a programmer or database designer while the table design or programming is under way.

These rules will not be in the specification because they are so obvious that the customer would not think to write them down. It is not that the customer considers putting them in and decides not to, the rules simply never come into the customer's mind. They do not tell you these requirements for the same reason they do not tell you that fish live in water and people breathe air.

Nevertheless, if you do not seek out these rules and put them into the application, then you will get a call or an email that something is wrong because a teacher has been put into two rooms at the same time. Make no mistake, the blame always falls on the programmer, because, after all, what idiot would write a program that would let a teacher be in two places at once? It is very very hard to look a customer in the eye and say, "Well somebody really should have told me about that requirement."

Later in this series we will see more about this in an entry I am planning called "The Requirements Will Never Be Correct", but for now we will move on to the actual database design.

Looking At The Table

The table is fairly easy to work out:

     Rule 2                                   Primary Key
         |                                         |
   +-----+-----+                                   |
   |           |                                   |
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    
  XXX     |   XXXX     |   XX     |  XXX     |   XXX      
  XXX     |   XXXX     |   XX     |  XXX     |   XXX      
  XXX     |   XXXX     |   XX     |  XXX     |   XXX      
                  |                   |                       
                   Rule 3: No teacher may be in    
                           two places at once      

The example above is a simple case of multiple unique constraints on a single table. The term "unique constraint" means that one or more columns must have unique values, just like a primary key. These are sometimes called "candidate keys" as well.

Identifying The Pattern: Limited Transactions

Th Limited Transaction Pattern occurs when there are limitations on what transactions are allowed. To see what I mean by a limitation, we will look at a counter-example, a transaction table that has no limitations. A shopping cart is a good example. We would never tell a customer that they may only have one order per day, or that a salepersons may enter only one order per day, or anything else along those lines. But the school example is the opposite, there are several limitations on what kind of transactions are allowed. Right now we are looking at the limitations that can be addressed with unique constraints.


Here is the SQL that will create the table as it is depicted above:

CREATE TABLE schedule (
   classroom char(5)
  ,period    char(5)
  ,course    char(10)
  ,teacher   char(10)
  ,assign_id int IDENTITY
  -- First define the primary key 
  ,primary key (assign_id)
  -- Rules 2 and 3 use additional unique constraints:
  ,constraint unique rooms_xp    (period,classroom)
  ,constraint unique teachers_xp (period,teacher)
  -- Every column in this table is actually a foreign key!
  ,foreign key  (classroom) references classrooms (classroom)
  ,foreign key  (period)    references periods    (period)
  ,foreign key  (course)    references courses    (course)
  ,foreign key  (teacher)   references teachers   (teacher)

Final Recap Of The Pattern

So let's review one final time how we ended up with this table.

  1. We have a transaction table, because we have an interaction between master elements (teachers, courses, classrooms and periods), so we have several foreign keys.
  2. We used an integer primary key as per Rule of Thumb 4 for Transaction Tables.
  3. A classroom is limited to only one course per period, so we add a unique constraint to enforce that.
  4. A teacher is limited to only one course per period, so we add a unique constraint to enforce that.

The pattern we have is the result of a limited transaction table, a table listing transactions in which not every conceivable combination is actually allowed. In this case the limits come from the physical reality that you cannot be in two places at once.

Conclusion: Patterns Reduce Application Code

When I first began working with databases, I had no idea how many problems could be resolved into simple unique constraints and foreign keys. A big step that any code grinder takes towards becoming a real database programmer is realizing how many seemingly complicated and difficult tasks actually resolve down to unique constraints and foreign keys.

Every time you can put a constraint into a database then you reduce the complexity of your application code. If your framework can trap server errors and report them then you have a very simple way to enforce a lot of the business rules of your application.

Next week we are going to look at rules 4 and 5, which relate to rules about the student's enrollment in certain courses.

Next Essay: False Patterns and The Reverse Foreign Key

Sunday, January 20, 2008

Table Design Patterns: Cross-Reference Validation

Welcome to the Database Programmer. This is a blog for anybody who wants to learn about databases. I try to make it simple and easy to read without dumbing it down. Most programmers these days work on web sites of one sort or another, and since all non-trivial websites require a database, there is very good reason to learn how they work.

There is a new entry every Monday morning, and the complete table of contents is here. The overall theme right now is that good table design leads to tight and efficient code. Last week's entry gave a collection of guiding rules for crafting your primary keys, and now we can turn to the concept of Design Patterns in Table Design. Table Design Patterns are recurring patterns in tables and the relationships between them. If you learn what they are, you can learn to recognize these patterns in the requirements that users give you.

Most database programmers are comfortable with the idea of Table Design Patterns, and have their own rules they have worked out through experience. The patterns I present here are well known amongst database programmers, my only contribution is to explain them and to try over time to put them all in one place.

The Example: A School Class Schedule

In this series we are using the example of an application that manages a school of a few dozen faculty and few hundred or perhaps a couple thousand students. Each year the school administration must make up the actual class assignments for each teacher, including what classroom and period each class will be taught in. Then students must be assigned into the classes.

There are five rules that must be followed:

  1. A teacher must be qualified in advance for each course they teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

All five rules can be handled with some smart table design and we will not need any application code.

Next week we will look at rules 2 - 4, and the week after that we will tackle rule 5. This week we look at Rule 1.

Rule One and The Cross Reference Validation Pattern

When a user gives us their requirements, they will not express them as as computer programs or table designs. The user will express his needs in his own terms, expecting us to translate those terms into tables and programs. So our job is to translate Rule 1 into one or more tables with proper primary and foreign keys.

When we look at rule one we see that it splits into two requirements:

  • Subrule A: There must exist of a list of what course a teacher is qualified to teach.
  • Subrule B: All actual course assignments must match to the list of allowed (or qualified) classes.

When I see these two requirements together I automatically think "validate against a cross reference." I call this the "Cross Reference Validation" pattern. It is different from a simple foreign key validation. Let's go through it and see why.

We can start start at subrule B, because it is easy to recognize. Any time some entry in a table must match an entry in another table, that means a foreign key.

Now we want to work out what the parent table looks like. Is it a simple master table with a one-column character key? Is it a transaction table? We answer this buy drilling deeper into Subrule A:

  • Assumption 1: There is a list of teachers
  • Assumption 2: There is a list of courses
  • Assumption 3: There will be a list of teachers and courses together.

So the parent table must be a cross reference because each entry will list a teacher and a course. It should go without saying that we will have a table of tgeachers and another table of courses, so the table of qualifications must be a cross reference between these two.

Here is a picture of the Cross Reference Validation pattern:

  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   1
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   2  
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   3 
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   4  
                            |          |
                            |          |
       Use A foreign        |          |
       key to make sure     |          |
       teacher is qualified |          |
       for each course      |          |
                         COURSE     | TEACHER
                           XXX      | XXXX
                           XXX      | XXXX

Here is the SQL to create these tables. I have left out anything not directly related to our Cross Reference Validation pattern:

-- Create the bottom table: allowed courses by teacher
CREATE TABLE courses_x_teachers (
   teacher char(10) 
  ,course  char(10)
  ,foreign key (teacher) references teachers(teacher)
  ,foreign key (courses) references courses(course)
  ,primary key (teacher,course)

-- Create the main table (assumes we have already created
-- teh TEACHERS table, the CLASSROOMS table and so forth
CREATE TABLE enrollment (
   classroom char(5)
  ,period    char(5)
  ,course    char(10)
  ,teacher   char(10)
  ,student   int 
  -- this is a trx table, so use an integer ID for pk
  ,assign_id int IDENTITY
  ,primary key (assign_id)
  -- The cross reference validation pattern needs a foreign key
  ,foreign key                    (teacher,course) 
   references courses_x_teachers  (teacher,course)

Another Comment On Integer Keys

In last week's essay we saw that the rule of thumb for cross-references is to use a multi-column primary key and not to use an integer key. It should be obvious now why we do that. If we used an integer key, then it could not help us!

The tables as described above completely satisfy the business requirement: no entry is allowed if the teacher is not approved for that course. But an integer key provides no such value, it is useless. If you use an integer key for the cross-reference table, then you are required to write extra program code to "chase" the key out to the cross-reference and make sure the values for teacher and course match what you have in the child table.

Foreign Keys and Performance

One time I was giving a presentation and somebody in the audience said they had been told not to use foreign keys because they reduced performance. To me this was like saying don't put gas in your car because it weighs down the car and reduces mileage. The question was such a shock and so unexpected that I had no decent answer.

Since that time I have discovered that this kind of nonsense is actually prevelant these days. Such advice may make sense in cases where the data structure is trivial (at most 7-10 tables) and the presentation is paramount, but in any serious application such advice is ridiculous. Here is the real scoop.

Somehow, some way, your code must ensure that every class assignment made in the above example is valid. This means that the check for teacher-course validity must be made. If you do not perform this check then your software is structurally unsound and you will have bad data, with no way to stop it and you will always be patching live systems in crisis mode.

So because you must make the checks, we can ask, which is better, to make the checks in program code or using the declarative constraints in the table definition? This is easy to answer. Without the foreign key, our program must make one round trip to the server to make the check, and then a second round trip to do the insert. But if you use foreign keys you only have to make one round trip. Since the overhead of making a trip to the server is often more than the time spent executing on the server, the foreign key solution will often perform twice as well as the application code solution.

In short, if you're worried about performance, use foreign keys, not application code.

Architecture Note: Server-Side Errors

Many programmers are taught not to use foreign keys and so they are not used to the idea that the database server will throw errors. Once you start using the database for all that it is worth, you will depend more and more on the errors it throws, so you want to make sure your framework can read them and report them to the user the same way it reports your application-generated errors.

Conclusion: Learn to Recognize Foreign Keys

User requirements will never be expressed as program code or table design, but we can recognize common patterns in them. One of those patterns is the Cross Reference Validation pattern, which we implement with a foreign key into a cross reference table. This and other patterns will stand out if we examine user requirements with an aim to identifying:

  • Lists of things you are keeping track of. These go into master tables, like courses and teachers.
  • Relationships between those master items, like a list of teacher-course qualifications.
  • Restrictions on how things can interact, so that a teacher must be qualified to teach courses means there will be a foreign key somehow into that teacher-course cross reference from some other table.

Next week we will examine Rules 2-4 and find out more about how unique constraints and their associated patterns can reduce the amount of code in our applications.

Next Essay: Limited Transaction Pattern

Table Design Patterns

This entry lists all of the Table Design Patterns that I have described in the blog entries. I will update it whenever a new pattern is described.

Basic Table Types

These patterns describe the kinds of things that you store in tables. Each pattern is characterized by the relative number of columns and rows, and whether it stores either information about permanent things or interactions between permanent things.

These patterns were described in the entry on A Sane Approach to Primary Keys.

Pattern Name Relative Column Count Relative Row Count Type Notes
Reference Small Small Permanent Use single-column character primary key.
Small Master Small Small Permanent Use single-column character primary key.
Large Master Large Large Permanent Use integer auto-assigned primary key
Transactions n/a n/a Transient Describes interactions between things, like a customer purchase of an item or a student's enrollment in a class. Use integer auto-assigned primary key
Cross Reference n/a n/a Permanent Describes relationships between master entries, such as an item's price group or a teacher's department. Use multi-column primary keys.

Expanded Table Types

The Limited Transaction Pattern occurs when restrictions on allowed transactions require one or more additional unique constraints on a transaction table.

The Impermanent Primary Key pattern occurs when a value that is a good choice for a natural key will change from time to time. For this pattern we use a pair of tables to track the entity.

Foreign Key Patterns

There are two fundamental kinds of foreign key, which correspond to the "master table" and "transaction tables" types.

The cross-reference validation pattern occurs when an entry must be validated against some previously defined relationship between master items.

Secure Patterns

Some table patterns depend upon security as a basic part of their definition. Different combinations of SELECT, INSERT, UPDATE, and DELETE permissions can replace complex application logic with zero-code server-implemented solutions.

Denormalization Patterns

Many seasoned database programmers denormalize their databases for a variety of reasons. Like all database activities, these also follow patterns. In the post Denormalization Patterns, we see three distinct patterns:

Other Patterns

The Resolution Pattern occurs when a value may come from more than one place and you must resolve the possibilities into a final choice.

History Tables provide three major benefits. They provide an audit trail of user actions, they give you the ability to reproduce the state of a table at some prior time, and if they are cleverly designed they can produce very useful aggregate numbers such as a company's total open orders for any given day in the past or the total change in open balances in any arbitrary period of time.

If you need to Sequence Dependencies it can be done with a combination of tables and server-side code.

You can implement Secure Password Resets entirely in the database server.


Sometimes user requirements appear to call for things that are impossible to do. When the analysis leads to one of these patterns it may seem like a dead-end, but there are usually valid patterns hiding beneath these.

  • When user requirements say "If X happens then Y may not happen" some analysts will see this as saying an entry in table A prohibits an entry in table B. This is a Reverse Foreign Key, which does not exist and cannot be implemented, it is an anti-pattern. These are often A Primary Key in Disguise.

Monday, January 14, 2008

Database Skills: A Sane Approach To Choosing Primary Keys

Welcome to the Database Programmer. This blog is for anyone who wants to learn about databases, both simple and advanced. Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

Every Monday morning there is a new entry. The complete table of contents for the Monday morning series is here.

This week's entry is rather long. I strongly considered splitting it into two weeks, but decided to keep it as one so that it would be an easier source of reference in the future.

There is no One True Primary Key

There are several competing theories out there on how to choose primary keys. Most of them tell you to use a single kind of key for all tables, usually an integer. In contrast to those theories I have found that a robust application uses different kinds of keys for different kinds of tables. In the last 15 years I have worked on projects large and small, simple and complex. Sometimes I had total technical control, and sometimes I had to work with what others gave me, and sometimes it was a little of both. Today's essay reflects what I have worked out in those years, and how I build my tables today. My goal is to report what actually works, not to promote a particular theory about how everybody should do something.

This week we will see "rules of thumb". A rule of thumb is a guiding idea that will tend to hold true most of the time, but which you may decide to change in certain circumstances.

Rule of Thumb 1: Use Character Keys For Reference Tables

A reference table is one that tends to be strongly constant over time and has relatively few columns. Sometimes a reference table may come already populated by the programmer. Examples include tables of country codes (perhaps with international telephone prefixes), a table of provinces or states within a country, or a table of timezones. In this series I have been using the example of a school management program, for that program we might give the user a reference table of school subjects like history, math, physics and so forth.

For these tables it is best to make a character primary key, which we often call a "code", as in "timezone code" or "country code" or "subject code." The strategy is to make a code which can be used on its own as a meaningful value that people can understand. This gives us tables that are easier to use for both programmer and end-user.

Let's consider our school management program. We have a table of teachers (populated by the school staff), and a table of subjects which we have provided as a reference table. When a teacher joins the faculty, somebody must enter the subjects that that teacher is qualified to each. The tables below show two examples of what this table might look like, which is easier to read?



Teacher | Subject                 Teacher     |  Subject
--------+----------               ------------+-----------
72      | 28                      SRUSSEL     |  PHYSICS
72      | 32                      SRUSSEL     |  CALCULUS
72      | 72                      SRUSSEL     |  HISTORY
45      | 28                      ACLAYBORNE  |  PHYSICS
45      | 29                      ACLAYBORNE  |  CELLBIOLOGY
45      | 45                      ACLAYBORNE  |  RUSSIAN

The table of character keys is much easier to work with, for the simple reason that many times you can just use the codes themselves, so you can avoid a lot of JOINs to the main tables. With integers you must always JOIN to the master table so you can get a meaningful value to show the user. But not only is the table itself easier to read when you are debugging, it is easier to work with when writing queries:

-- The character key example is pretty simple:
SELECT teacher,subject FROM teachers_x_subjects

-- The integer key absolutely requires joins
SELECT x.teacher_id,x.subject_id,s.description
  FROM teachers_x_subjects x
  JOIN teachers t ON x.teacher_id = t.teacher_id
  JOIN subjects s ON x.subject_id = s.subject_id

I often hear people say they do not like SQL because it is so complicated and they hate doing so many JOINs. It makes me wonder if the person is lost in a JOIN jungle caused by very bad advice about always using integer primary keys.

If you are using some kind of ORM system that tries to protect you from coding any SQL, that basic problem of over-complicated tables will still appear in your code. One way or another you must enter details that tell the ORM system how to get the descriptions, which would not be necessary if the keys were meaningful character values.

We can now see the surprising fact that the integer keys will slow us down in many situations. Not only do they have no performance advantage, but they actually hurt performance. The reason is because they require joins on almost every query. A 3-table query with two joins will always be much slower than a 1-table query with no joins. If you are using an ORM system that does not do JOIN's, but instead does separate fetches, then you have 3 round trips to the server instead of 1, and heaven forbid you have queries in a nested loop, the performance will simply crash and burn. All of this is kind of ironic since you so often hear people blindly repeat the dogmatic phrase "We will use integer keys for performance reasons..."

Rule of Thumb 2: Use Character Keys for Small Master Tables

Many database programmers use the term "master table" to mean any table that lists the properties of things that have some permanence, like customers, teachers, students, school subjects, countries, timezones, items (skus), and anything else that can be listed once and used many times in other places. Generally a master table has more columns than a simple reference table.

Some master tables are small and do not change often. In our ongoing example of a school management application, the list of teachers is a good example of a small master table. Compared to the list of students, which is much larger and changes every year, the table of teachers at most schools (except for huge state universities) will have only a few changes each year.

For tables like this it is good to allow the user to enter character keys if they want to. Some schools will insist on being allowed to choose their own codes like 'SRUSSEL' for "Saxifrage Russel", while others will say, "Why should I have to make up a code, can't the computer do that?"

For these tables I have found it useful to always define the primary key as a character column, and then to allow some flexibility in how it is generated. Common ways of generating codes include:

  1. Letting the user make up their own code
  2. Generating a code out of some other column or columns, like first letter of first name, plus 5 letters of last name, plus three numeric digits. (This used to be very popular in decades past).
  3. Generate a number.

The key idea here is to follow the needs of your users. Option #2 above is one of the most useful because it gives you the best of both worlds.

Rule of Thumb 3: Use Integers For Large Master Tables

Some master tables are large or they change often, or both. In our ongoing example of a school management application, the list of students will change every year, with many students coming and going. Another example is a doctor's office that has patients coming and going all of the time. I have found it best to use plain integer keys here because:

  • Unlike small master tables (like teachers) or reference tables (like school subjects), a code is not likely to have any meaning for the end-user, so the biggest argument for using it does not hold.
  • Unlike reference tables, the master table is likely to have many more columns and you will probably end up JOINing to the table many times. This means our other big reason for using codes, which is to avoid JOINs, does not hold either.
  • It is not realistic to expect end-users to be making up codes for large tables, and since the codes will have no value, why should the end-user be troubled with the job?
  • Writing algorithms to generate unique codes will run into more difficulties, and since the code has no value why bother?

Rule of Thumb 4: Use Integers For Transaction Tables

Many database programmers use the term "transaction table" to mean any kind of table that records some kind of interaction or event between master tables. In an eCommerce program the shopping cart tables are all transaction tables, they record the purchase of items by customers. In our school management program the actual classes taken by students are transactions, because they record specific interactions between students and teachers.

For these tables the auto-generated integer key tends to be the most useful. I am not going to present any arguments for this because most programmers find it self-evident. It should be enough to say that any attempt to use a compound key (like customer + date ) always ends up causing a problem by limiting what can be entered, so the meaningless integer key is the way to go.

Rule of Thumb 5: Use Multi-Column Keys In Cross References

A useful database will end up with a lot of cross reference tables in it. A cross-reference table is any table that lists various facts about how master tables relate to each other. These tables are extremely useful for validating transactions. In fact, next week's entry will be all about these tables and how to use them.

For now the important point is that the primary key of a cross-reference is a combination of the foreign keys. We do not make up an extra column, either integer or character.


Teacher     |  Subject
------ -----+-------------

The SQL for this table would resemble something like this:

CREATE TABLE teachers_x_subjects (
    teacher char(10)
   ,subject char(10)
   ,primary key (teacher,subject)
   ,foreign key (teacher) references teachers(teacher)
   ,foreign key (subject) references subjects(subject)

The reasons for this are rather complex, and next week the entire entry will be devoted to this and similar ideas. For now we will note that this approach lets us validate teacher-class assignments so that no teacher is assigned to teach a class she is not qualified for. Using a new column as a primary key does not allow that, and therefore leads to more complicated and error-prone code.

Rule of Thumb 6: Use Given Keys For Non-Insert Imports

Many systems today that we create will interact with systems that already exist. A typical eCommerce program will get a list of items and maybe even customers from the company's main computer system.

For some of these tables, your own system will absolutely never make new rows. A very common example is a table of items on an eCommerce site that is loaded up from some other computer system.

For these tables, the simplest route is to use whatever key exists on the table as it is given to you. Any other route involves more work with no clear motivation for putting out the effort.

Rule of Thumb 7: Use Integer Keys for Import/Export Tables

Sometimes you may have a table whose original values come from another system, but unlike the previous case your own system is generating new rows for the table, and you may have to send these rows back to the original system.

One classic example of this is a list of customers. I created a website a few years ago where the list of customers is updated from a different system from time-to-time. However, new customers can also sign up online. Both systems are handing the customer list back and forth from time to time to keep them reconciled.

In these cases I have an integer primary key for the table because it follows Rule of Thumb 3, it is a large master table. The most important concept here is that you must not try to combine your key and the key from the original table. Keep the key from the original table in its own column, index on it, and use it for updates, but do not try to enforce it as a unique column. The other system must take care of its own key, and your system must take care of yours.

Rule of Thumb 8: Use An Object Id On All Tables

Back when people were getting excited about the concept of "Object-Relational Databases", they came up with the term "object id" to denote a column that contains some unique value but otherwise has no meaning. The same idea exists with different names, but Object ID is now the term that most people understand so that is the term I will use.

Your programs can be made simpler in many cases if you add an object id to every single table in addition to the primary key. An object id is useful specifically for user interface code. If you use an object id, then it is easier to write UPDATE and DELETE statements, and it is easier to write framework or ORM code that does these things for you.

If you are following these rules of thumb closely in your project then it is important not to use the object id as a primary key, and therefore you may never use it as a foreign key either. If you use an object id as the primary key then you lose a lot of the benefits of the character keys listed above.

Also if you follow these rules in your projects it means that your transaction tables have both an auto-generated primary key like CART_ID and an auto-generated object id. Some programmers are bothered by this because we don't like the idea that two columns appear to be doing the same thing, and we try to save a column. But personally this does not bother me because it helps me write robust applications, and this is not 1985 where a 10MB hard drive cost hundreds of dollars.

Absolute Rule 1: Only Atomic Values

This is not merely a "rule of thumb" but a rule that I follow absolutely. It is actually part of First Normal Form, which is that column values must be atomic, or indivisible. Another way to say it is that the column must not have "subvalues" buried in it.

I have included this rule here instead of with First Normal Form because when most programmers violate this rule they are making primary keys by combining different values together. In our example of a school program, if we have a list of the actual students taking classes in a given school year, you might have a squashed-up primary key column like this:

CLASS_CODE                | STUDENT

There are two practical problems with doing this:

  1. You cannot use a foreign key to validate the sub-values, so you must code validation manually.
  2. Retrieving the sub-values requires extra code, either in the SELECT or in your client code. If the values were in separate columns this would not be necessary.

Absolute Rule 2: No Magic Values

Another rule that I follow is to absolutely never have magic values. A magic value is a value in a column that causes some non-obvious result. I have included this is in this essay because most programmers who break this rule do so by hard-coding special actions to occur based on values of keys in reference tables and master tables.

An example might be a table of teachers, where one of the teacher values is something like "SUBSTITUTE", and the program is hardcoded to do a lot of different things when it sees this value. Magic values are bad because the code is harder to debug. It may not be obvious to a programmer that some special value of the TEACHER column would cause special actions to occur. But if you have a column called FLAG_SUBSTITUTE then any programmer who must maintain code written by somebody else will have a much easier time of it.

Magic numbers also confuse end-users. It may seem obvious to us that the value "SUBSTITUTE" in the teacher column means substitute, but if this value causes other things to occur, and we are in the regular habit of having these values in lots of tables, then the compound effect can be lots and lots of phone calls from confused users, and big trouble for the software developer's bottom line.

Finally, magic numbers limit you. If you use the value "SUBSTITUTE" as a single teacher in the teachers file, then how do you keep track of the dozen-odd substitutes the school may hire in a year? The end-user is stuck here, they must use pen and paper. It is much better to allow them to enter the substitute as a regular faculty member with a FLAG_SUBSTITUE column to check off.

Magic numbers have plagued programming since long before databases came around. Here is a link to The Jargon File, which talks about magic numbers in other contexts.

Conclusion: Many Kinds of Tables, Many Kinds of Keys

This week we have seen that there can be many practical benefits to using different kinds of keys for different kinds of tables. Using the right kind of key for the right kind of table leads to simpler code and better performance, whether you code SQL directly or use an ORM system.

Remember always that your application will always follow the same structure as your tables. If the tables are designed well, the code will be lean, tight, efficient, and robust. Because table design is so important, it is best to know well the different kinds of tables there are: reference, master, cross-reference, and transaction, and to build the keys wisely.

Next week we will zoom into cross references and find out how important they are in good table design and how they will improve your applications.

Next Essay: Cross Reference Validation Pattern

Monday, January 7, 2008

Database Skills: Third Normal Form and Calculated Values

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Update: There is a new essay on Business Logic that also goes into how calculated values affect your code paradigm.

Third Normal Form Defined

Third normal form is pretty simple. In plain terms it means that no column can depend on a non-key column. In technical terms we say that there are no "transitive dependencies."

The example below shows a table that is not in third normal form. The problem is that the letter grade depends on the numeric grade:

HIST-101 |  2008 | RUSSELL |  NIRGALAI   | 80    |   B
HIST-101 |  2008 | RUSSELL |  JBOONE     | 90    |   A
HIST-101 |  2008 | RUSSELL |  PCLAYBORNE | 95    |   A

To repeat: The key is SUBJECT + YEAR + TEACHER + STUDENT, there are no duplicate values of this combination of columns. In plain terms we mean that no student can be listed in the same class twice, which only makes sense after all. We note next that there is only one actual property of the table, the GRADE column. This is the student's final grade in a particular class.

The violation is the last column, LETTER, which is not a property of the key but is functionally dependent upon the GRADE value. The term "functionally dependent" means that column LETTER is a function of GRADE.

As always, normalization is meant to help us. The problem with the table as-is is that you cannot guarantee that when the GRADE column is updated that the LETTER column will be correctly updated. The non-normalized table can lead to what we call UPDATE ANOMALIES.

Problem: Calculated Values Are a Fact of Life

We have seen in prior essays that enforcing first normal form and second normal form leads to more tables and simpler code. We saw in both cases that a programmer's natural tendency to use a "simple" solution of fewer non-normalized tables leads to more complicated and error-prone code. But when we come to third normal form the benefits do not seem so clear-cut.

Third normal form tells us to leave out the LETTER column, but the plain fact is that the LETTER column is very useful! The situation will only get worse if there are a few dozen (or a few hundred!) tables. If we leave out all calculated values then we must provide for them outside of the database, but if we put them into the tables we risk all kinds of errors in live data.

This is where things get sticky. In this week's essay I am going to explain the classic solutions to this problem, and include my own as the last one. It is my personal opinion that the problem of calculated values is at the heart of many of the flame wars, arguments and battles that are fought in the arena of database application development. You can't live with 'em but you can't live without 'em. Some kind of logical approach is required that gives us the benefits of normalization and the benefits of calculated values.

Preserving 3NF With Another Table

In many cases something that looks like a calculation can actually be removed out to another table and turned into a foreign key. As a matter of fact, the example I gave above can be moved out to another table:

 100   |  A
  99   |  A
  98   |  A
  97   |  A
  96   |  A
  95   |  A
  94   |  A
  93   |  A
  92   |  A
  91   |  A
  90   |  A
  89   |  B
  88   |  B
  87   |  B
  86   |  B
  .... etc ....
  1    |  F
  0    |  F

Most programmers would not think of a solution like this because it has too many rows! Your basic programmer finds it ugly, so he will not use it. This underscores a recurring theme of this series: skills that lead to good coding will lead to bad table design. Table design should be done according to the principles of table design, not the principles of coding. The customer will never know you have an "ugly" table, they will just know your system works well.

There are certain benefits that you always get when you normalize and when you put everything into tables. The big advantage is that you can change your system without modifying a program file. In the United States we would leave my "ugly" table as it is for public schools, but for many private schools we would change it to 93-100 for an "A", 85-92 for a "B", and so forth. We could make 70-100 "P" for pass and 0-69 "F" for fail. We can do all of this without changing any program code, which is always a good thing.

Preserving 3NF With a View

A "database view" is a SELECT statement that is given a name and stored permanently in the database. The nifty thing about views is that they can be queried with a SELECT statement just like a table. A view is a great tool for preserving normalization because you can keep the table itself fully normalized and put the calculated values in the view. In most databases a view is created with syntax like this:

SELECT subject, year, teacher, student , grade
      ,CASE WHEN grade >= 90 THEN 'A'
            WHEN grade >= 80 THEN 'B'
            WHEN grade >= 70 THEN 'C'
            WHEN grade >= 60 THEN 'D' 
            ELSE 'F' END as LETTER
 FROM grades;
-- this command pulls only from the normalized table
SELECT * FROM grades

-- this command gives letter grades also
SELECT * FROM grades_extended

Because a view is simply a SELECT statement, a view can JOIN information from multiple tables, and it can also do aggregate functions like SUM, AVG and so forth, so views are very powerful indeed.

Views are very popular and can really help out in simple situations. I have found that they become very difficult to manage as the table count increases, and so I personally do not use them for this purpose anymore, but I would not discourage anybody who wants to try them out. They will probably benefit you a lot.

Preserving 3NF by Calculating As Needed

Another approach is to keep the calculated columns out of the database and calculate them as needed.

Some years ago I was made Lead Systems Architect on a package that already had bout 750 tables when I took the position. The tables were in strict third normal form: no calculated values, no exceptions.

All calculations were made as needed, and they were all performed in program code. This particular program was an ERP package, so there were a great many monetary calculations. Absolutely nothing about it was simple, every single calculation had many possible formulas that took into account discounts, locations, customer types, time of year, and as far as I could tell the sunspot cycle and the phase of the moon.

The sales team was very successful, and so this program was under heavy modification at all times by programmers on two continents. The end result was this:

  • Many calculations involved calls to subroutines in 2, 3 or 10 different programs, making them nearly impossible to debug.
  • The program was completely impossible to document, we didn't even try.
  • The program was impossible to test, because of course nobody really knew what it was supposed to be doing.
  • Performance was often terrible. The need to calculate every value on every access put a huge strain on the database as information was pulled over and over and cycles were spent doing calculations.
  • Only a few people "knew where the bodies were buried", which contradicts standard advice for keeping staff interchangeable.
  • Sometimes a customer would print an invoice, then print it a few months later after an upgrade and get different numbers, which we could not explain or correct except at great time and expense.

When somebody tells me to keep 3rd normal form religiously and keep all calculations in program code I say "been there, seen that, no thanks."

Discarding 3NF by Using Program Code

The mostly popular approach by far is to calculate and store derived values using program code.

Most programmers use some kind of object-oriented language, and are used to the idea that they will have a GRADES class that handles the calculations for the GRADES table. Whenever anybody inserts or updates the table, some method looks at the GRADE column (a value from 0-100) and re-calculates the LETTER code, then saves the whole thing in the table. This violates 3NF but makes the database more fully populated and therefore more useful.

The major advantage to this approach is that most programmers understand how to do it with little prompting or explanation. Moreover, it does make the database indeed much more useful.

The largest problem with this approach is that subvertible. Unless the programmer is absolutely certain that nobody can access the database except through her code, the possibility always remains that a different program will mess up the values. In fact, you must be able to guarantee that even within your program no rogue programmer can accidentally (or maliciously) issue commands that subvert the values. If you get even one prima donna on your team that does not believe in following rules, serious damage can be done.

Some programmers say, "that's ok, my team is small and tight and only my program will access the data," but this just means they never plan to sell a program to a company that has their own IT department. That company will demand access to the database, possibly even write access, and then you have a problem.

However, when all is said and done, this is a perfectly valid way to get working programs written. When it is reasonable to expect that nobody else can access the database, and you trust your team, then this method may be the Right Thing.

Discarding 3NF With Triggers

The last approach I will explain is the one that I use myself. In this approach you put a "trigger" on the table that performs the calculation for you. A trigger is a small program that resides in the database server itself and is written is some kind of SQL-like extended language (or Perl or C or anything supported by your server). The trigger executes whenever certain events occur in the database. Specifically, every trigger fires for exactly one table, and might fire before or after an event, and might fire for insert, update, or delete. Triggers are ideal for performing calculations.

Trigger code tends to be highly specific to the platform you are using. The code below works for PostgreSQL and will definitely not work for MS SQL Server or MySQL, you will have to look up the syntax for those systems and do a little translation. But here it is:

    new.letter = CASE WHEN new.grade >= 90 THEN 'A'
                      WHEN new.grade >= 80 THEN 'B'
                      WHEN new.grade >= 70 THEN 'C'
                      WHEN new.grade >= 60 THEN 'D'
                      ELSE 'F' END;

    FOR EACH ROW EXECUTE PROCEDURE grades_ins_before_f();

I should note that PostgreSQL's triggers are very powerful, but their two-stage command to make a trigger is very annoying. Other servers such as MS SQL Server, DB/2 and MySQL have simpler syntax. I put up with PostgreSQL because it is so powerful and it is free software.

One very big advantage of this system is that the calculated value is non-subvertible, no rogue commands can corrupt the calculated values. I personally love this because it gives me all of the advantages of normalized data while also getting useful derived data in the tables for easy access.

Conclusion: Denormalization and Sorting It Out

Third normal form itself is easy enough to understand, but it brings us our first conflict between good database principles and the real world. The principles of normalization keep out calculated values, but calculated values are part of every useful applicaiton.

This is why so many people say that it is necessary to denormalize. But keep in mind always that there is a difference between non-normalized and de-normalized. We have no interest in a non-normalized database, where the designer has not even bothered to identify the correct tables, primary keys, and foreign keys.

What we have seen instead is that it is very useful first to normalize a database by identifying the individual things we want to keep track of, making an individual table for each specific kind of thing, and then identifying the primary properties of each of these things. The process of de-normalization begins when we define calculated values and begin to put those into the tables. Once we do this, we must choose a method to ensure that these values are correct, using views, triggers, program code or a combination of the three.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.