Sunday, March 30, 2008

How the SQL UNION Affects Table Design

Welcome to the Database Programmer!

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Introducing the UNION

Last week we saw a JOIN. A JOIN will combine values from two (or more) different tables and put them into the same row of the output.

In contrast to the JOIN, which puts values into the same row, the output of a UNION is to add together rows from more than one query to make a larger result. Here is an example of a UNION of a customers table and a vendors table:

The image shows that rows from two tables are combined one after the other in the results. The SQL is here:

SELECT customer as code,'CUST' as type,name,state
  FROM customers
UNION ALL
SELECT customer as code,'VEND' as type,name,state
  FROM vendors

You Usually Want UNION ALL

The example above uses the syntax "UNION ALL", with the keyword "ALL" being added. You will see this "ALL" keyword on every example on this blog and in many examples elsewhere.

If the "ALL" keyword is not included, most database servers will examine every row of the results and attempt to eliminate duplicates. They do this by examining every single column of every row. This means if you pull a 10-column query of 3000 rows out of one table and 6000 rows out of another, the database will attempt to de-duplicate 9000 rows based on all ten columns. This has two disadvantages, which are:

  1. It is slow, like a snail crawling through glue, and
  2. Most people don't actually expect or want the query to be de-duplicated.

Including the "ALL" keyword tells the server to return all rows and not bother to try to de-duplicate them.

Numbering Columns

When you do a GROUP BY (which we will see in later weeks) or an ORDER BY, most database servers require you to list the columns by number, not by name. This is because the values in the result may be coming from columns that had different names in the original base tables.

Object Oriented Influences

The example above shows only three columns each for the customers and vendors tables. We can assume of course that they have more columns, and we can also guess that many of those columns will be the same for both tables. In any case where you use a UNION, you may find yourself asking why you have two tables, and if you should have only one.

To make matters worse, if you learned your table design in the school of Object Orientation then you will have a very strong desire to make a base class called "trading partner" and make customers and vendors into subclasses of that base class. Then you want your tables to reflect your classes so there you are with one table.

Nevertheless, this is usually a mistake. It will make your code more complicated and error prone. To understand why, let's look at UNION again.

The UNION clause allows you to combine information from separate similar sources only when needed. In other words, the UNION clause lets you combine information upon demand, without requiring a permanent combination.

If you combine vendors and customers you have a problem that checks can be issued to customers, or orders can be entered against vendor accounts. To prevent this, you need complicated business logic in your application, and additional columns in the tables. Moreover, a customer will have columns a vendor does not and vice-versa, so you need more logic to ignore some columns or hardcode them or otherwise handle them based on what operation is begin performed. If they are separate, simple foreign keys do the trick and you don't need that extra code. Once you know about the UNION clause, there is little incentive to combine entities that should not be combined.

Conclusion: Combine at Output

If you go back and review the essays on table design patterns you will see that good table design is all about separating facts out into many different tables. The goal of the separation is to store each fact in exactly one place. Using primary keys and foreign keys on a well normalized database ensures that data is correct while it is on the way in.

However, on the way out, you need to recombine those facts. Two weeks ago we saw that the JOIN combines facts from different tables into a row. This week we saw that we can use the UNION to combine results vertically, that is, to add the results of one query to the results of another. Judicial use of UNION makes your application lean and efficient by letting you normalize data to ensure correctness on the way in, while still combining facts where necessary on the way out.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

Sunday, March 23, 2008

Of Tables and Constraints

Hello and welcome to the Database programmer blog. This is a blog for anybody who wants to learn about databases on their own terms and find out how database decisions affect their applications.

Constraints are probably the least talked about topic in application and database design. I suspect that there is something deep in the souls of programmers that just does not like to talk about limitations. We like to talk about flexibility, extensions, abstraction and so forth. Constraints rub us wrong. But of course we also love to write absolute rules for ourselves and others (thou shalt always use integer primary keys, thou shalt not use HTML TABLE elements, and so on). So go figure.

If you want to make a deadline, you have to limit what you do. You cannot make a list a kilometer long and finish it in a day. By the same principle, if a database is going to accurately store data, there must be rules about what makes the data correct. These rules we call constraints.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Table Structure is The First Constraint

Table structure is not usually listed as a constraint, but your table structures represent the first and most basic constraints on your application.

The table structure can be pictured as the mission statement of the application. If you have a table of students, we can safely assume you must record certain facts about students. Likewise, if you do not have a table of Nobel Prize Winners, we can safely assume that Novel Laureates are outside the scope of your application.

This basic fact, that table structure represents a codification of design decisions, is worth getting into a little deeper.

Where Do You Want The Zebra?

The importance of table structures can be seen with an analogy. Imagine you purchase a building and some printing equipment, and then you hire some people and create a printing shop. Somebody comes in and says, "Where do you want this zebra?" The answer is that you have no need or use for a zebra in a printing shop, and it will actually get in the way and cause problems. So the answer is "Not here!" There is a constraint on the use of your building, which is that wild game is not welcome.

Database tables represent that decision to commit to a certain task or collection of tasks. They constrain the end users to perform the mission of the company, and not some other mission.

Sometimes the more naive programmers will not give up on this point, and they will respond with something like "Well, yes, but computers do not have the limitations of physical systems, they are so much more flexible." This is true but wrong. A computer language like C++ is incredibly flexible in its potential, but that potential is worth nothing until somebody sits down and writes code to do a specific task, at which point that code is good only for that task. In other words, the general purpose value of the computer is only realized when dedicated to a specific purpose. Likewise, a database server is in principle capable of storing data about anything, but is only useful once decisions have been made to create specific tables to store specific things.

To say it one more time, table designs are constraints because they say what will be tracked and what will not be tracked.

Primary Keys, Unique Constraints and Foreign Keys

The next category of constraints have been well discussed on this blog, and I do not want to spend too much time on them here, but we will do a quick review.

Your table designs are the foundation of your application. In simple terms, a well normalized database will have a table for every different kind of thing that must be tracked (students, courses, classrooms, etc). In addition, there are always plenty of cross references that list relationships between these primary entries (teacher to courses they are qualified to teach), and transactions that represent interactions between these (which courses a student has completed).

The primary key makes sure that there are no duplicates, and the foreign key ensures that data stored in separate tables makes sense. For instance, imagine a table that lists which courses each teacher is qualifed to teach. It would not make much sense to list teachers in this table that are not on the faculty, and the foreign key prevents these kinds of mistakes.

As a veteran database designer I am still often astonished at how many seemingly complicated and unique business requirements reduce to a simple collection of tables, with their primary and foreign keys. I have seen this happen so many times that I am now automatically skeptical when somebody tells me they have 'special needs' that probably will not fit into simple tables. They always end up fitting into tables.

Check Constraints

Once you have your tables and their keys, there is one more kind of constraint you can use, which is called the CHECK constraint. A check constraint is nothing more than some SQL expression that must always be true, and is attached either to a column or a table.

MySQL does not support CHECK constraints as of version 5.0. This is one of the reasons why users of other database systems often say nasty things about MySQL.

A typical example for a check constraint is that a discount price must be lower than a regular price. Another use is to make sure that a number is between one and 100, so that it can be used as a percentage. This bit of SQL illustrates these ideas in the PostgreSQL dialect:

CREATE SEQUENCE orderlines_line
CREATE TABLE ORDERLINES (
   order int references orders (order)
   ,line int DEFAULT nextval('orderlines_line')
   ,sku char(10) references items (sku)
   ,price numeric(10,2)
   ,discount_price numeric(10,2) CHECK (discount_price < price)
   ,tax_pct numeric(3) CHECK (tax_pct >= 0 AND tax_pct <= 100)
)

Like all server-side technologies, the CHECK constraint can greatly reduce the amount of client-side code you have to write. Going further, the CHECK constraint makes your application easier to port to other languages, if for no other reason than there is less code to port.

More Complex Constraints

Sometimes constraints are more complex than can be expressed with a CHECK constraint. In these cases you can attach a "trigger" to a table that fires when a row is written. The trigger can modify the data being written or prevent the write from happening. Triggers are a large topic so they will be treated separately in other essays. Like all other computer technologies, triggers have plenty of fans who quietly use them to great affect, while there are plenty of noisy types telling you your nose will fall off if you use them.

Conclusion

This week we have seen that constraints are used to help to ensure that all data going into a database is correct. Constraints begin with the basic decisions about tables themselves, and then go through the selection of keys, and can be more complex CHECK constraints or even be completely general purpose code-based rules enforced in Triggers.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Sunday, March 16, 2008

The JOIN is the Cornerstone of Powerful Queries

Welcome to the Database Programmer!

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Very Basics of JOIN

Sometimes you need information that has been separated out into different tables. For instance, imagine you are a programmer on an in-house eCommerce site for a company that sells computer parts around the world. The Sales Manager walks down the hall one day and says she wants a detail listing of customer types and order dates. Our customer types are in the CUSTOMERS table, and the order dates are in the ORDERS table, so what she wants is this:

A SELECT statement to pull these columns would look like this:

SELECT customers.customer,customers.custtype,orders.date
  FROM customers 
  JOIN orders     ON customer.customer = order.customer

Refining With Aliases

Because the JOIN is so basic and common, it can get very cumbersome to constantly spell out long table names in front of every column. Therefore you can use an "alias" in the JOIN clause to give each table a nickname:

SELECT c.customer,c.custtype,o.date
  FROM customers c
  JOIN orders    o ON c.customer = o.customer

More Than One JOIN

Sometimes the information you need is in two tables that cannot be JOINed because they are not "next" to each other. Put another way, neither table has a foreign key to the other. Put a third way, they do not have any columns in column. Let us pretend our troublesome Sales Manager comes down the hall again and this time she wants a listing of every item ordered by every customer type. This time she is looking for the following:

This time we need to start with the CUSTOMERS table and then "go through" the ORDERS table to get to the information we need in the LINES table. This means two JOINs:

SELECT c.custtype,l.sku
  FROM customers c
  JOIN orders    o ON c.customer = o.customer
  JOIN lines     l ON o.order    = l.order

Most Requests Will Make Sense

Sometimes a user's request will appear to make no sense. The user asks for a combination of values that appear to have no connection, and we programmers object, "that makes no sense! Why would you want that?"

The most important idea to keep in mind here is not a technical idea at all, it is more a matter of how to keep people happy. In my experience it is extremely rare for a customer to ask for a query that well and truly makes no sense. Just because I do not understand it does not mean it makes no sense! In fact the Sales Manager likely knows her job very well and if she is asking for items by customer type she must have a reason.

Now, with that being said, the technical solution in these cases is to follow the foreign keys. If a database has been designed well, all tables will be connected to each other through foreign keys, and you can trace out a path that connects the various data points by following these foreign keys.

Denormalizing For Performance

You have probably heard people say that sometimes you need to "denormalize for performance." Now we will look at what that means.

Consider an assignment given to two people, one of them a veteran database programmer and the other a newbie. It is guaranteed that the veteran's database design will have a lot more tables in it than the newbie's database. This is because the veteran knows he will have far fewer errors getting data in if he keeps a separate table for each level of detail required by the program. By contrast, the newbie will be guided by a strange desire to save on tables as if there is some kind of world-wide shortage of tables.

But the veteran now has a problem. While normalization is great for ensuring correctness on the way in, it tends to require more JOINs on the way out, and it so happens that JOINs are rather expensive for a database to perform. In fact, they are one of the most expensive operations there is, and they only get worse as the number of tables being JOINed increases.

Therefore, the veteran will sometimes take a design to the fullest of normalization, and then deliberately denormalize it to reduce JOINs. A very simple example is adding the CUSTTYPE column to the ORDERS table and then copying the value of Customer Type onto each order. If the programmer is confident that the value will always be copied correctly, then any report on sales that involves customer types can avoid an expensive JOIN between ORDERS and CUSTOMERS. This is the essence of the "Denormalizing for Performance" approach, and we will see more essays specifically on that topic later in this series.

Denormalized is not the same as non-normalized. The newbie will have fewer tables, tables that are non-normalized because they have values bunched together that do not belong together. The newbie will spend a lot of time correcting data errors as a result of this. The veteran however will have lots of normalized tables and will look for (or write) a framework that assists in controlling where a user can write values and when the values are copied around to the their de-normalized spots.

Conclusion: Do Not Fear the JOIN

Just as the foreign key is the fundamental (and in fact the only) mechanism that relates data together, so the JOIN is the basic building block that ties that information back together.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

Sunday, March 9, 2008

Introduction To Queries

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Today we are going to look at performance, direct SQL coding, and then begin with the basics of the SQL SELECT command.

Disk Activity Determines Performance

Before we even look at the SQL SELECT command, we must know what motivates the experienced database programmer to pick certain kinds of queries and avoid others. The first motivator is of course performance. We all want our programs to go fast. If you want a fast database program, then you have to think about the hard disk.

The slowest device in a computer is the disk drive. The price of disk reads is so much higher than in-memory operations that there is nothing to gain by optimizing code unless disk reads are optimized first. I often tell my programmers to consider in-memory operations to be "free" when coding, and to concentrate all optimization efforts on reducing disk reads. This approach may gloss over some important truths, but it is an excellent starting point for database beginners.

Optimizing disk reads comes down to writing efficient queries on top of well-designed tables. With that being said, we have one more short note to cover before going into the syntax of the queries.

A Quick Note on Inline SQL

Many authors and framework programmers discourage the use of SQL SELECT statements directly in application code. This essay contains no opinions on that question.

However, it is important to know that the SELECT statement must be coded somehow and sent to the server, whether you code it manually or some framework tool generates it for you. This essay is all about how to code (or generate) that SELECT statement. Knowing how it all works is a requirement if you code your own framework or if you suspect your chosen framework may not be your friend in all cases.

Introducing SQL Select

The simplest query is four words (or symbols) long. If your database has a table of countries then here is a very simple query that will work:

SELECT * FROM COUNTRIES

This query will return all rows and columns from a table. Depending on what language you code in, these may come back in an associative array, an array of generic objects (or similar), or some other special-purpose object like a ResultSet.

You have probably been told not to use the "*" in a SELECT, for performance reasons. This is usually good advice. In the simplest case, you save bandwidth by only retrieving the columns that are of interest to you. If your table contains long varchar or text (aka clob) columns there are even more reasons to avoid "SELECT *". When you have long varchar and text columns, they may be stored outside of the main storage for the table, causing the server to look in two places to retrieve each row. Therefore, avoiding "SELECT *" and always specifying just the columns you need reduces disk reads on the server and reduces bandwidth delivering the results.

But as this is an introductory essay it is important to know how to retrieve a complete table, so I have used the "SELECT *" here.

Filtering Results with WHERE

The WHERE clause limits which rows from the base table will go into the query results. You specify a WHERE clause as one or more boolean conditional expressions. Multiple expressions can be separated by AND and OR, using parentheses to group expressions. You can review your product's documentation to see all of the comparisons and functions that are available. A moderate WHERE clause might look like this:

SELECT country,name
  FROM COUNTRIES
 WHERE country like 'A%'
   AND (   name like 'D%'
        OR name like 'E%'
       )
   AND continent = 'Africa'

Filtering and Performance

The primary purpose of a WHERE clause is to obtain the correct result. However, it is also a very important performance tool. Here is why.

If you are completely and totally new to database programming, you may get the idea that you will skip the WHERE clause and do your filtering in the application. This may seem like a good idea because you save the trouble of learning two languages. Instead of learning SQL plus your application language, you can concentrate on just your application language. And so you make a reasonable decision to use as little SQL as possible and just do everything in application code.

The drawback to this perfectly reasonable suggestion is that it violates our first performance concept, it creates a huge disk read burden. If you need five rows out of 50,000, then filtering in the application requires the database server to read all 50,000 rows off the disk. On top of that, these have to be delivered to your application for processing.

Making use of the WHERE clause means that only 5 rows are read off the disk (this assumes the presence of an index which will be explained in a later essay). In this particular example, using a WHERE clause will perform 1000 times faster than not using it. Of course this is only a single very vague example, but since a database application is composed largely of queries, it is definitely a good idea to have all of these queries start out on solid ground.

Foreign Keys and JOIN

Next week we are going to look at JOINs in much detail, but I want to mention them here briefly. The JOIN clause lets you return results from more than one table, and the JOIN determines how the rows from multiple tables will be matched to each other.

For this week I will say only that good queries will almost always use foreign keys as the basis of their JOINs. We have seen in these essays more than once that the foreign key is the fundamental and only way to connect information in separate tables. Naturally, therefore, the foreign key will loom large in our discussion of JOIN, since JOIN controls the combined retrieval of information from separate tables.

Sorting Query Results

You can sort query results by including an ORDER BY clause in the query. Simply name the columns:

SELECT customer,order,date 
  FROM ORDERS
 WHERE date >= '2008-03-01'
 ORDER BY date,customer

Some database servers let you put an ASC or DESC in front of individual columns, while other servers can only apply a DESC or ASC term to the entire sort operation.

Overall application performance comes into play with ORDER BY clauses. It is almost universally true that you can sort faster on the database server than you can in your code. You want to make sure that your manual queries contain ORDER BY clauses, and that your framework is generating them. You do not want to be sorting in application code in most cases.

Order of Terms Matters

You have to put the various clauses into the right order or they will not work. The order is:

SELECT ....
  FROM ....
  JOIN .... ON ....
 WHERE ....
 ORDER BY...

Conclusions

This week we began to examine queries, by looking at the very basics of the SQL SELECT query syntax. Not surprisingly, performance issues came up for every single part of the query, from the column list to the ORDER BY.

For performance, we looked at the basic idea that disk reads determine performance, which we will see more of in later weeks. I also mentioned that table design determines query efficiency, but we have not gotten very deep into that yet.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

Sunday, March 2, 2008

The Requirements Are Always Wrong, Or, Iterative Database Development

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

Requirements and Changes

We all write programs for other people to use, a programmer without his users is like a musician without an audience. So it goes without saying that we need to know what our customers want.

The requirements we are given will never be right. We are destined to get requirements that are incomplete, incorrect, contradictory, and impossible. We cannot control this reality, we can only control our response to it.

Most programming theory is at its base a response to this reality, so that you have the "waterfall" crowd trying to get everything right before you start, the CMM crowd trying to figure out exactly how long it will take, and the "agile" crowd saying you'll never get it right so get started and modify as you go.

On the database side we do not have such a wealth of theory on how to deal with mistakes (and changes). This may be because the trend these days is for theories and frameworks that preach a gleeful ignorance of database principles, or it may be because SQL/Relational databases were already firmly established before the explosion of the web. But whatever the reason, we have a general need for a basic review of how to build your database when you know that the requirements you are given will be full of holes.

The Perfect Requirement

Before we look at best practices, we will spend a few paragraphs looking at why the requirements will never be right. For one particular requirement to go completely right, each of these seven things numbered events below must happen:

  1. The customer herself must understand the requirement.
  2. The customer must consider the requirement worth mentioning.
  3. The customer must state the requirement correctly.
  4. The requirement must get to the programmer without corruption.
  5. The programmer must understand it.
  6. The programmer must design tables correctly.
  7. The programmer must correctly write supporting code.

These are not impossible, but since human beings are not perfect, each step will be wrong some percent of the time, and sometimes more than one will be wrong, and so on.

The last 3 all involve programmer error, and they all point to the need for your development process to have some iterative element, because it is inevitable that somebody will be going back and repeating some work.

Steps 1 and 3 involve customer error. If you are lucky the error will be bad enough to be noticeable, so that you can go back to the customer and ask about it before any tables are built or code is written. In any event, dealing with these problems is much more about diplomacy than technical knowledge. On the plus side, since these things are inevitable, if you build your diplomatic skills you will win a lot of good will from your customers, which is worth far more than technical skills.

Item 4 deserves some special mention. Item 4 is more likely to get screwed up the larger your organization is. Some schools of thought recognize this and preach a super-close connection between customers and programmers, though this is often very difficult to get across to managers who believe that layers of communication provide protection against mistakes.

Item 2 is probably the worst. This gets messed up when some requirement is so much a part of a customer's experience that it would never occur to them to tell you, anymore than they would include "remember to keep breathing" in instructions on how to setup a home theatre system. (See the earlier essay on Limited Transactions for an example).

It Only Gets Worse

The seven items listed above are the simplest possible list of requirements. In a larger company there will be intermediate steps, each of which is usually meant to introduce oversight but which usually introduces more errors. This can only make things worse for the requirements.

Moreover, the guiding philosophy of the technical manager(s) can introduce serious systematic error. For instance, certain fashionable ideas like "Table structure is not that important, just get the code right first" can take hold of a project and cause huge amounts of wasted effort because it just ain't so.

There are lots more problems like prima donna programmers, staff changes, politics, and so on. There is no end to the number of factors that can interfere with the correct articulation, communication, and implementation of requirements.

Iterative Database Development

The experience of decades has led again and again to the conclusion that software development is iterative, you will always go back and change the code. There is no such thing as the finished program (except maybe for PING), and there is rarely any such thing as the bug-free program (again except maybe for PING).

But how exactly do we do iterative develpment with a database? That is what I would like to address now.

We have to begin by looking at the kinds of changes you can make to a database.

  1. Trivial Changes Like widening a column
  2. Additons such as new columns and new tables.
  3. Tangled and Difficult changes like correcting a structure mistake and copying and modifying data into new locations.

Any sensible person is going to try to avoid #3, and by contrast, changes of type #1 are no problem if we have tools that support structure changes. This leaves us with the surprising conclusion that a strategy for iterative database development is to spend your time making additions. When requirements are found to be wrong, incomplete, contradictory, or in any other way messed up, the ideal situation is to add a column or two, add a table or two, and add some code, without ever having to touch the existing code.

We will now see how to do this.

Identify Master Tables and Master Transactions

Every program has a core purpose, and at the center of this purpose will be master entities that have to be tracked. For an online store these will be at the very least customers, items, and a cart (header and lines). For the school management program that I have used in these essays, these are teachers, students, courses, and course enrollments.

When you are strongly confident that you have a few master entities identified, you can start to frame your efforts around them. You can ask for clarification of customer requirements in terms of these master entities by asking, "Is this a property of the customer?" or "Can a student have more than one major?"

The deliverable at this stage is skeleton tables, primary keys and foreign keys. That is, to identify the central entities and their relationship to each other. This effort is not so much about identifying the attributes or properties of these entities, but simply the entities themselves. Once you have identified the entities, you will basically spend the the rest of the project adding columns (and code) as the requirements are worked through. This is the essence of doing iterative development as a series of additions.

The great thing about this approach is that if you miss some major table you rarely have to perform major surgery on the work you have already done. If further discussions reveal a central entity that was missed before, just make a new table for it. The amazing and cool thing about this is that you rarely have to change the existing tables you have already worked up, except perhaps to add a new foreign key.

Translate Features Into Columns and Tables

Many features can be resolved into columns once you have the main tables worked out. For instance, on a school system the specifications may say "every student must choose a major." A seasoned database veteran immediately recognizes this as a foreign key. There is a table of majors (which is definitely related to the table of departments somehow) and a "MAJOR" column in the students table.

In terms of iterative development, we can see that it should be fairly painless to add in a system for majors without disturbing the rest of the system. We add a table of MAJORS (probably with a foreign key to the table of departments), and add a MAJOR column to the STUDENTS table as a foreign key to the MAJORS table.

To repeat the main theme, the last thing we want to do is get stuck massaging data or moving it from one table to another on a live system. By far the much easier route is to simply add tables or columns. If you can learn to think of features in terms of the underlying tables and columns, then you can move towards always making additions, which are far easier.

Identify Inert Columns

I like to use the term "inert" to refer to columns that are not involved in any business rules but must be printed out or displayed from time to time. The easiest example would be a student's mailing address. This has nothing to do with her major, her grades, or her courses, but must be printed occassionally for mailings or on paperwork.

The main point with inert columns is that you can add them in whenever you want without affecting the rest of the system. When the requirements omit these features they are the easiest to put in after the fact.

Go For Maximum Detail

Imagine you are asked to make a simple Business-to-Business system in which customers are shipped their goods and billed later. The requirements say only that "customers often pay more than one invoice with a single check and this should be easy for the user to process."

So you know already that you have a table of customers, and a table of invoices. Now we have to make a table of payments. Here is the killer question: does the table of payments link to invoices or to customers?

If we link payments to customers then we have the easiest system for the user, they just enter the customer and the check and we are done. By contrast, linking the payments to the invoices is much harder because we have to actually create two tables, a table of PAYMENTS and a table of SPLITS, which details how a payment is split out to apply to various payments. Worse, we probably have to add some special user interface screens to make it easy for the user to drag-n-drop or otherwise enter a payment and apply it to their open invoices.

However, the right thing to do is to go for maximum detail, and apply the payments to the invoices. There are two major reasons. First, it is more faithful record keeping. If you do not do it then the customer will sooner or later ask for the details you have not provided. Second, horror of horrors, if you have a live system and have been applying payments to customers, then when you go over to the other system you will have to move data around and, worse, you will have to guess at how to apply already entered data. This is why tables should always be created with maximum detail in terms of primary and foreign keys.

Everything Else

We have now seen a basic outline of how to do database design in such a way that you can do it iteratively by making mostly additions and avoiding painful data re-arrangments. There are more details than will fit into a single blog entry, but hopefully this will give the basic idea.

Next week we will begin to see query building. I had originally expected to do calculated values first, but upon reflection realized that queries should come first, because query needs often determine strategies for putting in calculated values.

Next Essay: Introduction To Queries