Sunday, July 27, 2008

Different Foreign Keys for Different Tables

A foreign key can be used to implement table design patterns that span multiple tables. By choosing how a foreign key handles a DELETE attempt on the parent table, you can structure your table designs to follow two standard patterns.

Welcome to the Database Programmer blog. This series of essays is for anybody who wants to learn about databases on their own terms. There is a complete Table of Contents, as well as a summary of Table Design Patterns. There is a new essay in this spot each Monday morning.

A Simple Example of Two Foreign Keys

Picture a basic shopping cart, with its two basic tables of CART and CART_LINES (or ORDERS and ORDER_LINES if you are more old-fashioned). The table CUSTOMERS is also in there as a parent to CARTS. Our three tables look something like this:

   CUSTOMERS
      |
      |
     /|\
     CART  Cart is child of customers
      |
      |
     /|\
   CART_LINES  Lines is child of Cart

There are two foreign keys here. CART has a foreign key to CUSTOMERS, and CART_LINES has a foreign key to CART, but the two foreign keys should behave very differently.

Table Types and Table Design Patterns

In A Sane Approach To Choosing Primary Keys we saw that table design begins with identifying the basic kinds of tables: Reference and Small Master Tables, Large Master Tables, Transactions, and Cross-References. Just as we picked different kinds of primary keys for the different tables, so will we pick different kinds of foreign keys between these tables.

Deleting a Customer

Imagine you have a customer who has made 10 orders in 2 years. A system administrator, who is allowed to basically do anything, goes into your admin screens, looks up the customer, and clicks [DELETE]. What should happen?

The near-universal answer is that the user should be denied the action. An error should come back that says "That customer has orders, cannot delete." We want it this way because we never want to delete any parent row and "orphan" the child rows. Database programmers know from long experience that if you allow the DELETE, your queries will give incorrect results, or you will work extremely hard with lots of weird LEFT JOINS and UNIONS trying to get them to come back correctly.

This is not an issue of "flexibility", where a more robust system would allow the deletion. This is a basic question of record-keeping. If the customer has orders on file then the customer must be kept on file. Enforcing this rule keeps code clean and simple, and trying to avoid this rule in the name of "flexibility" just makes heaps of work for everybody.

Going further, the administrator in question, who supposedly can do anything, may not violate the rule. An administrator is simply somebody who can do anything that would not produce bad data. Administrators should not be given the ability to violate the basic structure of the data, they simply have full rights to do anything within the structure of the data.

The DELETE RESTRICT Foreign Key

The behavior we want here is called DELETE RESTRICT. On most database servers this is the default behavior for a foreign key. It means that you cannot delete a parent table row if there are matching rows in the child table.

The DELETE RESTRICT pattern is almost universally used when the child table is a transaction table and the parent table is a master table or reference table.

The syntax looks something like this:

-- Most database servers implement DELETE RESTRICT
-- by default, so this syntax:
Create table CART (
    customer integer REFERENCES customers
   ,order    integer.....
)

-- ...is the same as this explicit syntax:
Create table CART (
    customer integer REFERENCES customers
                     ON DELETE RESTRICT 
   ,order    integer.....
)

Deleting An Order and DELETE CASCADE

Now let us say a staff member is on the phone with a customer, enters an order, enters five lines, and then the customers says "forget it" and the user needs to delete the entire order from the CART.

In this case the user wants to go delete the order, and he expects the computer to also delete the lines. This makes perfect sense, why keep the lines if we don't want the order?

It may seem strange that in the case of deleting a customer it makes perfect sense to stop the user, but when deleting an order it makes perfect sense to delete the lines as well.

The difference is that an entry in the CART table is a transaction entry. When a user deletes a transaction they almost always want to automatically delete all of the relevant rows from all child tables as well. The two rules basically are:

  • The user cannot delete a master entry that has transactions.
  • Deleting a transaction means deleting the entire transaction.

NOTE: By "transaction" here I mean financial transaction or other interaction between master elements. I do not mean a database transaction.

The syntax for DELETE CASCADE looks something like this:

-- if the user deletes a row from CART,
-- do them the favor of deleting all of the
-- lines as well
Create table CART_LINES (
    order   integer REFERENCES CART
                    ON DELETE CASCADE
   ,order_line integer....
)

Conclusion: Different Tables Types, Different Foreign Key Types

I have said many times in these essays that the foreign key is the only meaningful way to connect data in different tables. This week we have seen that the kind of foreign key you choose depends on what kind of tables you are connecting together. Children of master tables generally get DELETE RESTRICT, and children of transaction tables generally get DELETE CASCADE.

Next Essay: Javascript as a Foreign Language

Sunday, July 20, 2008

History Tables

A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.

This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns.

What to Put Into A History Table

Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.

We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:

ORDER | CUSTOMER | DATE     | LINES  |   TAX |  TOTAL |   PAID | BALANCE
------+----------+----------+--------+-------+--------+--------+---------
1234  | 9876     | 5/1/08   |  48.00 |  5.00 |  53.00 |      0 |   53.00
2345  | 9876     | 5/3/08   | 150.00 |     0 | 150.00 | 150.00 |       0
3456  | 5544     | 6/8/08   |  25.00 |  2.60 |  27.60 |  15.00 |   12.60
4567  | 3377     | 7/3/08   | 125.00 |  7.00 | 132.00 |  50.00 |   82.00

We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.

We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.

Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.

Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.

So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.

Tracking Changes to Numbers

While it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.

So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:

ORDER | USER_ID  | ACTION | DATE    | LINES_OLD | LINES_NEW | LINES_DELTA 
------+----------+--------+---------+-----------+-----------+-------------
1234  | sax      | UPD    | 5/1/08  |      0.00 |     48.00 |       48.00
1234  | arkady   | UPD    | 5/7/08  |     48.00 |     58.00 |       10.00
1234  | ralph    | UPD    | 6/1/08  |     58.00 |     25.00 |      -33.00
1234  | anne     | DEL    | 6/4/08  |     25.00 |      0.00 |      -25.00

I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.

You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.

Some of the Obvious Queries

There are few obvious queries that we can pull from the history table right away. These include the following:

-- Find the value of of the line items of an
-- order as of June 1st
SELECT LINES_NEW 
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
   AND DATE <= '2008-06-01'
 ORDER BY DATE DESC LIMIT 1;
 
-- Find the original value of the line items,
-- and the user who entered it.  
SELECT LINES_NEW, USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
 ORDER BY date LIMIT 1;
   
-- Find the users who have worked on an order
SELECT DISTINCT USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234;

Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.

Queries Involving the Delta

The real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.

-- Query to calculate the total open balance of all
-- orders as of a given date
SELECT SUM(BALANCE_DELTA) 
  FROM ORDERS_HISTORY
 WHERE DATE <= '2008-06-01';

This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.

But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.

If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:

-- Get begin balance at the beginning of the period
SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE
  FROM ORDERS_HISTORY
 WHERE DATE < '2008-06-01';

-- Get the total changes for each day.  When you
-- build the report on the client, add each day's
-- change amount to the prior day's balance
SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA
  FROM ORDERS_HISTORY
 WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30'
 GROUP BY DATE;

Keeping History Tables Clean

A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.

Problems With The Naive History Table

A very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.

A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.

History Table Security

History tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.

Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.

If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.

So to sum up, we have the following security rules for a history table:

  • No system user should be able to DELETE from the history table.
  • No system user should be able to UPDATE the history table.
  • No system user should be able to directly control the INSERT into the history table.
  • A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.

Implementation

As always, you have your choice of implementing the history mechanism in the client code or in the database itself.

The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.

Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.

Conclusion

History tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.

NEXT ESSAY: Different Foreign Keys For Different Tables

Monday, July 14, 2008

The Wonderful Awful Browser

When a desktop programmer tries to write database applications for the browser, he faces a great many challenges, both technical and cultural. Both sets of challenges appear because the browser and the web were invented for purposes different than our own. On the technical side we must reinvent huge amounts of functionality that we got "for free" with the old desktop systems of Foxpro, Delphi, VB Classic and so on, and on the cultural side we must wade through mountains of irrelevant or downright damaging advice that is aimed at people working on the next version of Facebook or eBay. In this essay we look at as many of these challenges as I can muster.

Why A Desktop Developer Would Move to The Web

When the browser first appeared, it totally lacked the technical powers required to replace desktop applications. Nevertheless, some programmers immediately began to ponder how to move into the world of the browser. The reasons were simple then, are simple now, and have not changed:

  • Far easier deployment -- nothing to install.
  • Worldwide access -- businesses with multiple locations are suddenly much easier to take care of.
  • You could now create a public website and give customers and vendors limited access to certain information.
  • Operating System independence. This is far more of a reality now than we dreamed it might be in the darkest days of Microsoft's Total World Domination, but there were visionaries early on who saw the possibilities.

So there are many programmers, and I am one of them, who continue to work on the same kinds of applications we did before the web existed, but who now deploy these applications in the browser, for the reasons listed above. Here now is our tale of woe and sorrow!

The Cultural Divide

While desktop programmers were scratching their heads and trying to figure out how to fit into this new world, a new generation of programmers was growing up who were perfecting this new platform and developing applications that were undreamed of before. Unfortunately, some of the good advice they dreamed up is either irrelevant or counter-productive to the database programmer who is deploying to the web.

The driving reality for the database application programmer is that her users are not surfing. They are using a dedicated program written for the purposes of the business they work for. Most of what the browser can do is either not necessary or positively in their way, and the browser lacks productivity tools that they took for granted in "the old system." This fact is central to the cultural divide between application programmers and web programmers.

The Infamous Back Button Problem

If I surf over to www.osnews.com and click on an article, when I am finished I click "BACK" once or twice until I'm back to osnews, and then pick another article. But to the application user, who is not surfing but is using a dedicated program, who has clicked "New Patient", typed in the info, and clicked "Save", the back button is a positive menace. It is misleading and dangerous. This has led to who-knows-how-many rants from web programmers telling application programmers, "You don't understand the web, you shouldn't write it that way," in which the desparate application programmer replies only, "but you don't understand, I must have it work this way." The simple fact is that when a user is modifying data in a browser there is no concept of BACK. There may "UNDO" or "REVERT", but once the data is saved it is saved. This is why application programmers resort to trying to hide or disable the button, or why they think they should be able to modify the history (which of course they cannot do because that would be a huge security hole for public sites).

Ajax only Makes The Back Button Worse

Picture a user on the customers screen, who then goes to the menu and picks the vendors screen. They work for five minutes on the vendors screen, and their wonderful snappy AJAX application is fetching search results and navigating from row to row and saving changes. Then they decide they made a mistake and hit [BACK] and wham! they're on the customer screen! It seems that the better the applications become, the worse the [BACK] button becomes. In my own shop we have finally decided to have the login program pop up a new window which does not have the [BACK] button or the address bar. This is considered heresy by web programmers (you don't understand the web! they cry) but of course what is true for them is not true for us, and vice-versa.

This also leads to much work. We must provide for such features as UNDO with no native support in the browser, and worse, with whatever native support the browser does have been intended for something totally different.

Your Application Must Work Without Javascript

This is dead wrong for the application programmer. Application programmers have a power that is totally outside the experience of a pure web programmer: we can dictate system requirements to the customer. This led to many unhappy problems before the web, but with Firefox (and firebug!) we now have a platform that is free and robust. We simply install Firefox (or instruct the IT department to do so) and we have a platform that we know will support our application.

Keyboard Shortcuts

Nothing illustrates the divide between the web and the desktop like keyboard shortcuts.

When Windows 95 swept the office world (but before the web really came into its own), programmers developed a new term for applications that required constant use of the mouse: we called them "mousetraps". The worst kind of mousetrap program requires the user to constantly lift their hand from the keyboard and go to the mouse, then back again. This is fatiguing, confusing, and terribly counter-productive for the end user.

But the real problem is that the browser was born a mousetrap. From the perspective of the desktop programmer, keyboard shortcuts are clearly an afterthought, a "red-headed stepchild" as they say. Native HTML supports only the ACCESSKEY attribute, and recently Firefox was changed so that the default key combination is CTRL+ALT instead of ALT. This small change led me to finally realize that these folks, to put it mildly, have never lived in my world and haven't the slightest clue what my users need. I could expect no help from them on this front.

The solution for the web programmer is to remember my users are not surfing, they are using a dedicated program. Therefore it is the Right Thing for the application programmer to hijack the CTRL-N key and have it mean "New Patient" (or New Customer, New Vendor, etc) instead of opening a new browser window. Moroever, he must kill the CTRL-N so that it does nothing on a page where there is no [New] button. If he does not, then sometimes CTRL-N will create a new patient, and sometimes it will pop up a new window with my.yahoo.com! So the application programmer confidently rewires the "standard" browser keys and has happier customers for his effort.

Technical Problems In the Browser

No Default Focus

Have you ever gone to a website where the first thing you must do is log in, but the user id input does not have focus? That is a sure sign that the page was written by a web programmer with no desktop experience. When you put a database program into the browser, you expect the user to be typing constantly, looking things up, adding information, and so forth. So the application programmer must ensure that his first control always receives focus. Call it petty if you like, but without it your program becomes a mousetrap. Perfection comes by concentrating on these small things that either annoy or please users.

Tabbing Off to Mars

Default browser behavior is to allow the user to TAB through controls in the order they were created. This can be modified by explicitly assigning TABINDEX attributes to the control. However, when you get to the last control, the browser then Tabs you up to the menu, or the address bar, or anywhere else.

In a business application, where the user is not surfing the web, this is wrong. Tabbing out of the content area is equilivant to exiting the application, it throws the user into a context that they do not need and (sad to say) do not understand.

When I first began deploying business apps in the browser I would get calls saying, "it's frozen" or "i'm typing and nothings happening" and other such mysterious claims. Once I observed the users I realized they were "tabbing to Mars", the focus was up on the menu or in the address bar or somewhere else equally irrelevant. So we created the idea of the "Tab Loop", so that when the user hits TAB on the last control it loops back to the first. This completely ended those calls.

The Tower Of Babel

Desktop programmers have a luxury undreamed of by web programmers: they can do all or nearly all of their programming in a single language, like classic Visual Basic, Foxpro, or Delphi (or heck, COBOL or 4GL!). Most of these programmers also know SQL, but it is not seen as a burden to learn it, it is just part of the job.

But when the application programmer moves to the web, he is confronted with at least four systems he must grasp if he is to perform as a master crafstman: (X)HTML, CSS, Javascript, and one or more server languages like Ruby, PHP, Java, etc. These different technologies all have syntaxes and philosophies that are different from each other and from past experience. All I can say is I'm glad I made the effort but I sure as heck hope I never have to make a change that dramatic again.

Let's Not Talk About State

When an application programmer moves to the web, he is confronted with the totally alien concept that he cannot maintain state. This idea has been discussed much in past years, and I suspect it may not be the problem it once was, as most of us have long since gotten past it. I did not want to leave it out completely, but it is way too large an issue to discuss in a paragraph and I doubt anyway that I could add to the wisdom that is already out there.

Lousy Widgets

The HTML SELECT element stinks. Every serious application programmer either downloads a replacement or writes his own replacement. In the old world of the desktop we did not have to do such things.

The final piece of the puzzle in my shop was jQuery. The irony of jQuery is that it seems to me its core function is DOM traversal and manipulation, but its elegant simplicity has drawn creative minds to do things like create really nice widgets for entering time. In my shop we finished off our desktop-in-browser framework by using jQuery extensively, this solved lots of our lousy widgets problem.

Salvation In the New Javascript Frameworks?

I should mention the new frameworks that are emerging for desktop-in-browser, such as extJS and Dojo, not to mention of course the Yahoo! User! Interface! Library!.

In my case I set out on the task of doing browser-based applications four years ago, when none of these technologies existed. So I developed my own simple browser-side framework. jQuery let me round off the rough edges, and now I have no need of a third-party framework. So I am afraid I cannot offer any experience in the use of these others.

But even so, if I could use somebody else's dedicated work and put my efforts elsewhere, that would only be wise, so I continue to watch them closely.

When Do Application Programmers Accept Advice From Web Programmers?

Does all of this mean that we application programmers can learn nothing from web programmers? Of course not, that we be arrogant in the extreme. The answer is that when we enter the world of the public website, we must seek the advice and guidance of the experts in that world. So when I write the public portion of the application, the part that is visible to customers, vendors and other trading partners, I have to follow the standards of common practice: the back button has to work, Javascript should be optional, it should work on IE (arg), and so on and so forth.

Conclusion: When Worlds Collide

The desktop application programmer who decides to deploy business applications in the browser will face two broad challenges: techical and culturual, both of which stem from the origins of the web, which are so different from the origins of the desktop. The technical challenges tend to center around features of the browser that are either lacking or downright counter-productive, and this is made worse because the advice a programmer will receive comes from a culture whose goals are very different from his own. The database programmer who wants to deploy applications in the browser must be prepared to reproduce a lot of features we took for granted in the desktop, and he must also be prepared to filter through the received wisdom and throw out anything that does not meet the needs of his end-users.

Addendum: After reading a few comments on ycombinator.com I should probably stress that the job of getting the application into the browser is completely doable (in fact I've done it myself). The browser can now easily handle the job of desktop applications. The OP lists the hazards I and many other have faced and overcome in getting there. A big part of the conclusion is that it is a lot easier to get done if you recognize why it seems so hard: which is that you may be getting advice from people whose goals are very different form yours.

Next Essay: History Tables

Sunday, July 6, 2008

Database Performance: Pay Me Now or Pay Me Later

Many database performance decisions come down to "pay me now or pay me later." Some decisions will produce faster inserts and updates at the cost of slower and more complex reads, while other decisions will slow down inserts and updates but provide faster and easier reads.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Paying With Indexes

An index speeds up SELECT operations. We are not going to go into any detail about how indexes work, this week we will stick to how indexes affect performance.

Imagine a table of 10,000 sales orders. You wish to pull out a handful of fields for orders placed on 5/1/08, so you issue this SELECT:

SELECT customer,order_total 
  FROM orders
 WHERE date = '2008-05-01'

If you do not have an index on the table, then the database server will have to scan every single row in the table to find the rows that match the WHERE clause. On the other hand, if you had an index on the date column, the server would first read the index to find pointers to the correct rows, and then read only the rows you needed. The index itself is optimized by various methods so that only a very few reads are necessary to find the correct values. Most databases support the following syntax:

CREATE INDEX some_unique_name ON orders (date)

As far as performance goes, an index will slow down write operations (INSERT, UPDATE, and DELETE) because the index must be updated when the write operation occurs. This cost is on top of the write operation itself. (Addendum added July 7: Jochen points out correctly in his comment that this statement oversimplifies things. While it is true that the index must be updated for writes, the index can also dramatically speed up UPDATE and DELETE operations if those operations use a WHERE clause that can benefit from the index.)

In terms of "pay me now or pay me later", when you regularly add a lot of indexes you are opting to "pay me now." You pay the price of slower writes to get faster reads. If you regularly avoid adding any indexes you are opting to "pay me later." You defer the costs of access to read time to get faster writes.

I should note that it is not possible to completely avoid indexes, nor is there any value in trying to. For instance, a primary key requires an index because otherwise you have to scan the entire table every time you do an INSERT, which is just plain crazy. Foreign keys benefit from indexes as well for similar reasons.

Paying With Views

A "view" is a stored SQL statement that you can SELECT from as if were a table. Imagine we have a table of TEACHERS and a table of COURSES that they are teaching in a particular year. We often need to display a list of courses with the names of the teachers. We can do this with a JOIN, but a view gives us an easier pre-defined way to do this:

CREATE VIEW courses_teachers AS
SELECT courses.room,courses.period,courses.teacher
      ,courses.year
      ,teacher.first_name,teacher.last_name
  FROM courses 
  JOIN teachers ON courses.teacher = teachers.teacher

...which now lets you do the easier SELECT:

SELECT * FROM courses_teachers WHERE year='2008';

In terms of "pay me now or pay me later" a view is always a "pay me later" decision. It makes for easier coding but the server must go out on every SELECT and gather together the data required.

The "pay me later" nature of a VIEW meets its greatest extreme when the view contains aggregations. Consider the following view which gives you easy access to customers and their lifetime history of orders and payments:

CREATE VIEW customers_extended AS
SELECT customers.*
      ,SUM(orders.order_total) as orders_total
      ,SUM(invoices.balance)   as balance
  FROM customers
  JOIN orders    ON customers.customer = orders.customer
  JOIN invoices  ON customers.customer = invoices.customer
  
-- Pulling from the VIEW requires a complete read
-- of relevant ORDERS and INVOICES tables
SELECT * FROM customers_extended
 WHERE customer = X;

This view is a "pay me later" proposition because every time you issue a SELECT from the view, it will have to scan many rows from the ORDERS and INVOICES tables. The contrasting method is to denormalize which is a "pay me now" approach.

Paying With Denormalization

Denormalizing means taking a normalized database and deliberately inserting redundant values. I have an essay on the three Denormalization Patterns that I use myself, which follow these three forms:

  • FETCH operations, where a value such as an item's price is copied from the ITEMS table into the ORDER_LINES table.
  • EXTEND operations, where you take the QTY and the PRICE columns in the ORDER_LINES table and write the EXTENDED_PRICE.
  • AGGREGATE operations, such as writing the total of ORDER_LINES onto the ORDERS table.

All of these operations fall into the "pay me now" category. When these denormalized columns are put into tables, they add to the the size of the table and increase the cost of write operations. However, when it comes time to SELECT out of the tables the values are all there ready to go, usually with fewer JOINs and lower overall disk activity.

Extreme Pay Me Now

In my line of work I deal with line-of-business programs that are commissioned or purchased by businesses to do their daily work. User counts are low and resources are high, because often I will have 10 users on a single server, with access via internet limited to only a few thousand potential customers of which very few are ever on at the same time.

In this context, I prefer to take the "pay me now" approach to its fullest realization. This means I tend to design my systems so that:

  • Any column a user is likely to filter on has an index.
  • Tables are fully denormalized, containing a wealth of derived values.

This means that all write operations on my systems are slower than they might otherwise be. However, this is more than acceptable within this context because the server is largely untaxed, and users do not notice the difference between 100ms and 200ms to save a row. So I can pay when the user does not notice and as reward I have very rich reporting and lookup abilities.

The extreme pay-me-now approach has one more advantage. The wealth of derived values in the database lets end-users find what they are looking for without calling a programmer and asking for a special page or report. Generally the more derived values there are the truer this becomes.

Extreme Pay Me Later

The extreme form of pay-me-later is a fully normalized database with no derived values and a minimum of indexes. Calculated values are available either in views, client-side code or both. This type of database is tuned for lots of fast writes because the cost of an INSERT or UPDATE has been kept to an absolute minimum. The database will be slower to perform ad-hoc or one-off queries because the server will have to do table scans whenever a user filters on anything except primary keys and foreign keys.

The lack of derived values in fully normalized databases also leads to more phone calls and emails asking the programmer to create a report or page that will work out derived values that are not present in the database.

Conclusion: Know Your Context

This week we have taken common database technologies such as indexes and views and seen how they affect performance. All of these technologies can be judged in terms of the "pay me now or pay me later" decision.

Database programmers normally choose to "pay me later" when they must support a large number of simultaneous write operations with a minimum of contention. These situations call for fewer indexes and strict normalization. The trade-off is that ad-hoc or one-off queries will involve more JOINs, more table scans and an increased likelihood the programmer will be called in for special cases.

When read operations are more common than writes, or when inquiries and reports are likely to be unpredictable, database programmers will choose to "pay me now" by doing more work on the write operation. There will be more indexes and more denormalized values, so that the user is more likely to quickly locate whatever they want without programmer intervention.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are: