Sunday, April 27, 2008

Denormalization Patterns

Welcome to the Database Programmer! The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Non-normalized Database

A denormalized database is one that has been meticulously normalized to eliminate redundancies, only to have redundancies deliberately put back in to meet other needs. That is the kind of database we are going to talk about today.

By contrast, a database is "non-normalized" if nobody ever bothered to normalize it. This is what you get when the programmer says, "I'm not concerned with the table structure yet, I'm working on the code first." In my experience this is something like saying, "I'm not worried about the plan for this work bench I'm building, I'm just going to throw the wood on the table saw and start cutting." We are not interested in non-normalized databases.

The Fully Normalized Database

The fully normalized database will have few if any redundancies, each fact will be stored in exactly one place. This database has one big advantage, which is that write actions, INSERT, UPDATE, and DELETE, will be very easy to code up correctly. The application code for a fully normalized database will be smooth, simple, easy to write and easy to maintain.

But decades of experience has shown that fully normalized databases have a few drawbacks, which is why practical-minded programmers always end up denormalizing their designs. The problems in particular are:

  • No calculated values. If you have a shopping cart with the columns "price" and "qty", it may seem natural to put in a column "extended_price" that holds price * qty. This is actually forbidden by third normal form, though almost everybody puts it in anyway.
  • Non-reproducible Calculations. Continuing from the point above, most shopping carts or ordering systems have many calculations that are far more complicated than simply doing price * qty. Often a calculation may depend on a dozen or more preceding calculations. In complex situations like this, it is all too common that changes in the program mean an invoice printed after an upgrade produces different numbers than the same invoice printed before the upgrade. You don't want that phone call!
  • JOIN Jungles. A fully normalized database will have information scattered in many different tables. Though this makes it easy to get it right going in, it can make it terribly difficult to get seemingly simple combinations of data back out.

The Denormalized Database

As I said above, the denormalized database is one that was first normalized by having its redundancies removed, only to have some redundancies deliberately put back in. This can solve all three of the problems listed above. There is a cost of course. When we denormalize we have to keep two things in mind:

  1. We must dream up a way to keep the redundant values correct. If we can pull this off we get all of the advantages of denormalization with no drawbacks.
  2. Following up on the first point, we have a better chance of getting it right if we can identify a set of denormalization patterns. Once they are identified, we can code up something in the framework that supports them, and now we can see the gold at the end of the rainbow.

The Foreign Key Is Our Friend (Again)

I have said many times in these essays that the foreign key is the only way to establish relationships between facts stored in different tables. We will now see how this relates to denormalizing our database.

Denormalization means introducing redundancies. In other words, a fact that was stored in only one place is now stored in two or more places. If we are going to copy a value from one table to another table, it stands to reason that there must be some logical relationship between those two tables. Since the only kind of relationship we can have between two tables is a foreign key, our denormalization patterns must in some way work with foreign keys.

The First Pattern: FETCH

Consider a shopping cart that has a column "sku" and another column "price." Most programmers lay out these tables and write some code that copies the price from the ITEMS table to the ORDER_LINES table. I call this pattern the "FETCH" because the price is FETCHed from the ITEMS table and written into the ORDER_LINES table.

Most programmers code up FETCH operations all over the place and do not ever realize they are denormalizing. I think this pattern is just so natural that most of us never think about it. If you examine your database applications you will likely see that you are doing this all over the place.

In order to get this pattern to operate correctly, your framework must make sure at very least that the SKU is not null when an INSERT is made to ORDER_LINES, and that the price is copied during the INSERT. You can maintain correctness by not allowing users to change the SKU on this table, if they change their minds they must delete a line and enter a new one. Or, you can make your framework a little more flexible and execute the FETCH again if the SKU changes on an UPDATE.

Sidebar: Is FETCH Really Denormalizing?

Die-hard relational theorists will tell you not to copy price from the items table. You are supposed to leave it where it belongs and use JOINs to pick up the price when it is needed. There are three arguments against this sort of purity.

The first practical argument is that it is horribly difficult to deal with complex calculations this way. It is far easier to copy the price when the line goes in, so you never have to "go looking" for it again.

The second practical argument is that performance tanks if you follow the die-hard relational approach. If you have to look in 6 tables every time somebody refreshes their cart you will have a much slower program than one that only has to look in one table.

But the third argument is more theoretical, and it is this: the FETCH is not really denormalizing. The idea is that when the customer makes an order your store has entered in an agreement to sell something at a particular price. That price is stored on the order and is now a fact about that order. It is not redundant because you are not storing the SKU's generic price, you are only storing the price that that customer is going to pay on this order. If the price changes 5 minutes after the customer places the order, they will expect to get the price as it was when they put it in the cart, and so you are actually doing the right thing by writing it to the order.

The Second Pattern: Aggregations

The FETCH that was described above is all about copying a value from a parent table to a child table. The opposite pattern occurs when you roll up values from children to parents. These are usually done as totals (SUMS), counts, averages, minimums, and maximums.

Looking at the ORDER_LINES again, if a customer has 3 items in their cart, it is perfectly natural to most programmers to put a column "PRODUCT_TOTAL" onto their ORDERS table that holds the sum of all of the lines. This is called an aggregation, because the result in the parent table is always some operation performed on the aggregation of all of the child rows.

Aggregrations are always denormalizing because they are values that could be derived from other values. To be specific, an aggregration violates third normal form because it introduces a non-key dependency - a value that is dependent not on the key but on values from a completely different table!

In order to make sure this value is always correct, the framework must always update the total on the parent table when any line in the child table changes. If your framework can do that successfully, your aggregations will always be correct.

The Third Pattern: EXTEND

The first two patterns we saw dealt with foreign keys. The first pattern, the FETCH, involves values travelling "downward" on a foreign key from parent to child. The second pattern involves values travelling "upward" on a foreign key from child to parent. The third and final denormalizing pattern involves calculated values within a row.

The example at the beginning of this essay was the column EXTENDED_PRICE, which holds the value of PRICE * QTY. This is an EXTEND operation, because it extends a row by adding a new redundant value. This is denormalizing because it violates third normal form, it introduces a value that is not dependent on any candidate key.

If you want to makes sure your EXTENDs are always correct then you need a framework that will always update the calculation when either of its dependent values changes.

Dependency Tracking

In describing the three denormalizing patterns above, I have explained what you need to make sure each one is performed successfully. There is a final requirement to keeping all of this correct, which is that the operations must be performed in the proper order.

Considering the shopping cart again, in particular the ORDER_LINES table, these three operations must occur in this order:

  1. The PRICE is FETCHed
  2. The EXTENDED_PRICE is calculated as an EXTEND
  3. The ORDERS table's PRODUCT_TOTAL value is adjusted.

Your framework must have a reasonable way to make sure that the operations are performed in the correct order, or they will not give the correct result. As a rule of thumb, in most systems the FETCHes come first, followed by the EXTENDs, and then the aggregations.

Meta-data can be a big help here. When I first contemplated these patterns about four years ago, it occurred to me that they could all be stored as formulas in the basic description of the database, and that a code generator would sequence them for me and generate the code, so that the operations would always occur in the correct order. I wrote the basic system in the fall of 2004 and have found it to work extremely well ever since. In my personal opinion, this is the only way to reliably handle these patterns.

Conclusion: Denormalization Also Follows Patterns

A fully normalized database makes it easy to get data in correctly, but makes it difficult to get it out. Denormalizing is the process of taking a normalized database and deliberately introducing redundancies to improve query writing, performance and correctness. Not surprisingly, denormalization has its own patterns. Two of these follow the foreign key, and the third one works inside of a single row. If you follow these patterns and fashion your framework to keep them correct, you get all of the benefits of denormalization without the concern for bad data.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Sunday, April 20, 2008

Advanced Table Design: Resolutions

Welcome to the Database Programmer! Every Monday morning this blog contains a new essay. This blog is for people who want to learn the practical realities of databases. Topics range from simple to advanced.

The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.

The complete table of contents is here.

The Resolution Pattern

A resolution is an interesting database pattern because it involves both table design and a very specific form of SELECT statement. Just getting the tables down is not enough, you have to know how to write the SELECT that will return the correct value.

Basic Description and Example

A resolution pattern occurs when you need a value and there is more than one place where it might be. As an example, consider the case of a computer services shop that provides complete IT services, including programming. In their billing system, they have a simple table that lists the rates for their various activities.

ACTIVITY  | RATE
----------+-------
ITGENERAL | 100
PROJMGT   | 200
SOFTWARE  | 150

This is simple enough, but now suppose that you have a particular employee that you bill out at $175.00/hour for software development. This makes the picture a little more complicated. But suppose that it gets more complicated, suppose that you enter into an arrangement with a particular customer to do volume software development for them for $135.00. And just to make it interesting, suppose you have a very specific arrangement with a particular customer to provide the services of a particular employee for $185.00 for project management.

With this many possible billing arrangements, your super-simple invoicing program is suddenly not so simple. On any particular invoice line, you must resolve the actual hourly billing rate out of several possibilities. Because you must resolve the value, this pattern is called a resolution.

Precise Description of the Resolution Pattern

A resolution pattern has these characteristics:

  • The goal of a resolution is to find a particular value. In our example this is a billing rate.
  • Resolutions examine multiple possible values and pick the first match according to precedence.
  • Precedence usually begins with the most specific and falls back to the most general. In our example the most specific possible rate is defined for a customer-activity-employee, while the most general is the default rate for an activity.

Resolutions are not always easy to recognize. Mostly this is because customers do not tell you "we have a resolution." Instead they tell you they have a billing rate. The explanation of the special overrides for employees comes in a different conversation, and perhaps to a different member of your team. Then later comes the explanation of the other overrides. The resolution only becomes apparent when the various requirements are all sorted out and put next to each other. Then somebody says, "Hey, there are four different formulas for the billing rate!" Then you know you have a resolution.

The Table Design

A resolution requires one table for each possible level of detail where a value might be supplied. In our example there will be a table for:

  • Rates by activity-customer-employee
  • Rates by activity-customer
  • Rates by activity-employee
  • Final default values by activity

These table only contain values when they are relevant. The table of activity-customer does not contain a row for every possible combination of activities and customers, it only contains a row when there has been some agreement to provide an activity to a specific customer for a special rate.

Here are the tables:

ACTIVITY | CUSTOMER | RATE
---------+----------+------
PROJMGT  | PRAXIS   |  225
SOFTWARE | PRAXIS   |  235


ACTIVITY | EMPLOYEE | RATE
---------+----------+------
PROJMGT  | SRUSSEL  |  225


ACTIVITY | EMPLOYEE | CUSTOMER | RATE
---------+----------+----------+------
PROJMGT  | HIROKO   | PRAXIS   |  250

Resolving In Client Code Will Kill Performance

Now consider that the there is a table somewhere that is used to drive billing. Maybe the employees themselves record their time in this table, or maybe some clerical staff member is entering them. Whoever puts them in, each record has an activity, an employee, and a customer (and of course hours). You need to write a program that finds the correct billing rate for each row.

A die-hard code grinder will do all of this in the client. He will write a query to pull all of the rows from the time entry table. Then he will loop through these rows. For each line he will query the server for a the most detailed value, activity-employee-customer. If it is not found he will do a second query for the next table in line, and so forth. This will be a performance disaster because his program will be making a huge number of round trips to the server. If he understood the LEFT JOIN he would need only one trip to the server.

First Stab with A LEFT JOIN

Here is a query that does most of what we need for the resolution:

SELECT ol.activity,ol.employee,ol.customer
      ,aec.rate as aec_rate
      ,ac.rate  as ac_rate
      ,ae.rate  as ae_rate
      ,a.rate
  FROM orderlines ol
  LEFT JOIN act_emp_cust_rates     aec 
    ON ol.activity = aec.activity
   AND ol.customer = aec.customer
   AND ol.employee = aec.employee
  LEFT JOIN act_cust_rates         ac
    ON ol.activity = ae.activity
   AND ol.customer = ae.customer
  LEFT JOIN act_emp_rates          ae 
    ON ol.activity = aec.activity
   AND ol.employee = aec.employee
  JOIN activities                  a 
    ON ol.activity = a.activity  
 WHERE (....relevant search conditions....)

The LEFT JOIN tells the server to return all matching rows from the orderlines table, even if there is no match in the various override tables. The above query will return something like this:

ACTIVITY | EMPLOYEE | CUSTOMER | AEC_RATE | AC_RATE | AE_RATE | RATE
---------+----------+----------+----------+---------+---------+------
PROJMGT  | HIROKO   | PRAXIS   |      250 |  null   |  null   |  200
PROJMGT  | NIRGAL   | PRAXIS   |     null |   225   |  null   |  200
SOFTWARE | SRUSSEL  | PRAXIS   |     null |   235   |  null   |  150
PROJMGT  | SRUSSEL  | GE       |     null |  null   |   225   |  200
PROJMGT  | SRUSSEL  | NASA     |     null |  null   |   225   |  200
SOFTWARE | HIROKO   | PRAXIS   |     null |   235   |  null   |  150
SOFTWARE | HIROKO   | GE       |     null |  null   |  null   |  150

The Final Form of the Query

The first form of the query returns all four possible rates, and the effect of a LEFT JOIN is to have a NULL value where there was no match on the right side.

We can do better than this and return the actual rate by using a COALESCE function. A COALESCE allows us to list two or more values, and the function returns the first one that is not null. This lets us return the actual resolved value from the server:

SELECT ol.activity,ol.employee,ol.customer
      ,COALESCE(aec.rate,ac.rate,ae.rate,a.rate) as rate
  FROM orderlines ol
  LEFT JOIN act_emp_cust_rates     aec 
    ON ol.activity = aec.activity
   AND ol.customer = aec.customer
   AND ol.employee = aec.employee
  LEFT JOIN act_cust_rates         ac
    ON ol.activity = ae.activity
   AND ol.customer = ae.customer
  LEFT JOIN act_emp_rates          ae 
    ON ol.activity = aec.activity
   AND ol.employee = aec.employee
  JOIN activities                  a 
    ON ol.activity = a.activity  
 WHERE (....relevant search conditions....)

...which gives us the complete answer:

ACTIVITY | EMPLOYEE | CUSTOMER | RATE 
---------+----------+----------+------
PROJMGT  | HIROKO   | PRAXIS   |  250 
PROJMGT  | NIRGAL   | PRAXIS   |  225 
SOFTWARE | SRUSSEL  | PRAXIS   |  235 
PROJMGT  | SRUSSEL  | GE       |  225 
PROJMGT  | SRUSSEL  | NASA     |  225 
SOFTWARE | HIROKO   | PRAXIS   |  235 
SOFTWARE | HIROKO   | GE       |  150 

Conclusion: Tables and Queries Go Together

We have seen this week our first table design pattern that requires a certain form of query. This shows us as well that queries themselves will fall into patterns, and we will definitely see more of these patterns in future essays.

Monday, April 14, 2008

JOINS Part Two, The Many Forms of JOIN

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Many Kinds of JOIN

When a programmer first decides to start learning SQL, the JOIN always appears simple at first. But it can produce some unexpected results. Sometimes a JOIN brings back more rows than the novice thinks it should, and sometimes less. This week we are going to concentrate on all of the various results that are produced by the JOIN.

A JOIN always operates on the same principle: it returns one row for each combination of rows from both sides of the match.

A Parent-Child JOIN Returns Number of Matching Children

Consider the following two simple tables:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE
---------+---------          ---------+-------+----------
 1234    | 11733              1234    |    57 | 3/1/08
 5283    | 77074              1234    |    78 | 3/15/08
                              1234    |    89 | 4/07/08
                              5283    |    23 | 2/13/08
                              5283    |    32 | 3/17/08

When you JOIN these two tables together, there are three combinations that match for customer 1234, and two combinations that match on 5283, so this query:

SELECT customers.zip
      ,orders.customer,orders.order,orders.date
  FROM customers
  JOIN orders    ON customers.customer = orders.customer

...will return 5 rows. Values from the child table appear once, and values from the parent tables repeat. The return count matches the number of rows in the child table:

 ZIP  | CUSTOMER | ORDER | DATE
------+----------+-------+----------
11733 |  1234    |    57 | 3/1/08
11733 |  1234    |    78 | 3/15/08
11733 |  1234    |    89 | 4/07/08
77074 |  5283    |    23 | 2/13/08
77074 |  5283    |    32 | 3/17/08

The Parent-Child And a Left JOIN

Now we consider the case when our Sales Manager comes down the hall and asks for a list of customers in zip code 11101, with their total sales. So you have this information in the database:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE    | AMOUNT
---------+---------          ---------+-------+---------+-------
 1234    | 11101              1234    |    57 | 3/1/08  |   25
 5283    | 11101              1234    |    78 | 3/15/08 |   35
 2938    | 11101              1234    |    89 | 4/07/08 |   45
                              5283    |    23 | 2/13/08 |   55
                              5283    |    32 | 3/17/08 |   65

...and you write this query:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  JOIN orders    o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and the result only has two rows:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120

The novice database programmer will look at this and say, "I don't get it, my filter says zip 11101, but I only got two rows back." This is because the JOIN clause acted as a filter, because the JOIN only returns where there is a match on both sides.

If you want to get all of the rows from the left side of the JOIN, even if they have no match on the right side, you use a LEFT JOIN:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  LEFT JOIN orders o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and now the answer is:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |  null

Of course we probably do not want that "null" because it messes up calculations, so you can use the COALESCE() function to replace nulls with zeroes. The COALESCE() function accepts a list of values and returns the first non-null:

SELECT c.customer,sum(COALESCE(o.amount,0)) as amount
  FROM customers c
  LEFT JOIN orders  o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and this completes the picture:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |     0

You Can JOIN a Table to Itself

Recently a customer of mine dropped by with a problem he had in an insurance database. He had a table that listed various companies and their total premiums collected by year. He is required by various regulation to report the premiums as income at the rate of 28% in the year they are collected, with the remaining 72% being reported in the following year. It did not appear readily obvious how to do this.

This is a case of self-JOIN, where you JOIN a table to itself to match rows to each other. In this case we created the following query:

SELECT tab1.year
      ,tab1.premium * .28 as premium1
      ,tab2.premium * .72 as premium2
  FROM premiums tab1
  JOIN premiums tab2 ON tab1.company = tab2.company
 WHERE tab2.year = tab1.year + 1

Notice that the JOIN condition only matches company to company, and an additional WHERE clause was required to limit the matches. Technically we would prefer to see that second condition in the JOIN clause, but most database servers do not support that. It is a little more confusing to a newbie as written, but not terribly so.

Another SELF JOIN and a FULL OUTER JOIN

Another customer of mine called recently to complain that his legacy system is not doing something right. He said there was information missing.

In his business, which is magazine distribution to retailers, he has a table called "DEFAULTS" that lists the default number of each magazine given to each store. Sometimes he wants to see the defaults for two magazines side by side, for all of the stores they are delivered to. If his defaults table looked like this:

MAGAZINE | STORE | DEFAULT
---------+-------+--------
123      | A     |   5
123      | B     |  10
123      | C     |   7
456      | A     |   6
456      | D     |   3

...and he wanted to see magazines 123 and 456 side by side, he would expect output like this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0
D     |    0     |    3

Here is how you can recognize this request as needing a self-join. If it appears that you need two separate filters for the same table, and the two filters result in values that are meant to sit side-by-side, then you have a self-join. Our first stab at the query would be:

SELECT d1.store
      ,d1.default as mag123
      ,d2.default as def456
  FROM defaults d1
  JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

This is the query I found in his program, and as soon as I saw it his complaint made sense. He said basically, "There are only 5 rows for magazines x and y and there should be 500." The problem goes back to the fact that a JOIN limits the answer to those cases where there is a match on both sides. The query above would give only this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6

...because only store A has a row for both magazines. What we want is every entry for each magazine for either store even if there is no matching entry in the other store. The LEFT JOIN we saw above would help, but would only do half the job, it would give us:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0

What we need is a FULL OUTER JOIN, which is basically a LEFT JOIN and a RIGHT JOIN. Its a LEFT JOIN that goes both ways. If we add in our COALESCE() functions the final query is:

SELECT d1.store
      ,COALESCE(d1.default,0) as mag123
      ,COALESCE(d2.default,0) as def456
  FROM defaults d1
  FULL OUTER JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

Conclusion: That your JOIN May Be Complete

The JOIN is a very powerful clause. The JOIN is the most-used method for combining information from multiple tables together. This week we saw that the JOIN has provisions for joining a table to itself, and for handling cases where the matches may not be all present.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

Sunday, April 6, 2008

GROUP BY, HAVING, SUM, AVG, and COUNT(*)

FACTOID: This is the 2nd most popular page on this blog.

There are links to other essays at the bottom of this post.

UPDATED Friday Nov 19, 2010

Aggregation

You can use a SQL SELECT to aggregate data. Aggregation combines rows together and performs some operation on their combined values. Very common aggregations are COUNT, SUM, and AVG.

The simplest use of aggregations is to examine an entire table and pull out only the aggregations, with no other columns specified. Consider this SQL:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sumSales
      ,AVG(sale_amount) as avgSales
  FROM orders

If you have a very small sales order table, say about 7 rows, like this:

ORDER |  DATE      | STATE | SALE_AMOUNT
------+------------+-------+-------------
 1234 | 2007-11-01 | NY    |       10.00
 1235 | 2007-12-01 | TX    |       15.00
 1236 | 2008-01-01 | CA    |       20.00
 1237 | 2008-02-01 | TX    |       25.00
 1238 | 2008-03-01 | CA    |       30.00
 1237 | 2008-04-01 | NY    |       35.00
 1238 | 2008-05-01 | NY    |       40.00

Then the simple query above produces a one-row output:

CNT  | SUM  | AVG
-----+------+-----
  7  | 175  |  25

Some Notes on The Syntax

When we use COUNT(*) we always put the asterisk inside.

Note that the example names the output columns by saying "as sumSales" and "as avgSales". This is important because without it we will get whatever the database server decides to call it, which will vary from platform to platform, so it is a good idea to learn to use the "AS" clause.

The WHERE Clause Filters BEFORE the Aggregation

If you want to get just the sales from New York state, you can put a WHERE clause in:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sumSales
      ,AVG(sale_amount) as avgSales
  FROM orders
 WHERE state = 'NY'

...and you will get only the results for NY:

CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33333

Notice of course that the average has a repeating decimal. Most databases have a ROUND function of some sort, so I can correct that with:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),2) as avg
  FROM orders
 WHERE state = 'NY'

...and get:

CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33

The Fun Begins With GROUP BY

The query above is fine, but it would be very laborious if you had to issue the query (or write a program to do it) for every possible state. The answer is the GROUP BY clause. The GROUP BY clause causes aggregations to occur in groups (naturally) for the columns you name.

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount)          as sumSales
      ,ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state

Which gives us this result:

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
TX    |  2  |  40  |  20
CA    |  2  |  50  |  25  

Every Column a GROUP BY or Aggregate

When you use the GROUP BY column then every column in the output must either be a group by column or must be an aggregate function. To understand this, imagine we put "Date" into the query above:

SELECT state,
     , date -- huh?? which value should we get??
     , COUNT(*) as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state

Several states have more than one row in the database, so the database server has to decide which value of DATE to give you. Since it cannot know which one you want, it throws an error and says in short, "don't confuse me!"

Two More Aggregations, MIN and MAX

If we think again about the DATE column, in most practical situations we usually want to know the smallest or largest value, or both, so this query is not uncommon:

SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state

which yields:

STATE | minDate    | maxDate    |CNT | SUM  | AVG
------+------------+------------+----+------+-----
NY    | 2007-11-01 | 2008-05-01 | 3  |  85  |  28
TX    | 2007-12-01 | 2008-02-01 | 2  |  40  |  20
CA    | 2008-01-01 | 2008-03-01 | 2  |  50  |  25  

HAVING Clause is Like WHERE after GROUP BY

The HAVING clause lets us put a filter on the results after the aggregation has taken place. If your Sales Manager wants to know which states have an average sale amount of $25.00 or more, then the query would look like this:

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount)          as sumSales
      ,ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
HAVING AVG(sale_amount) >= 25

Which gives us this result, notice that Texas is now missing, as they were just not selling big enough orders (sorry 'bout that Rhonda).

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
CA    |  2  |  50  |  25  

When to use WHERE, When to use HAVING

Then the Sales Manager might come down and say, 'I don't want the states who have no sales after December 2008'. We might automatically code the following, which is tragically wrong:

SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 -- WRONG! Will filter out individual rows!
 WHERE date <= '2008-12-31'
 GROUP BY state

The problem here is that individual rows that happened after 2008-12-31 will get filtered out, which will give you all stats for all states on sales before 2009. That is not right. The idea is to completely eliminate all results for states with no sales in 2009 or later, even if they had sales before that time. So we use MAX and the HAVING clause:

SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
HAVING MAX(date) >= '2008-12-31'

Using All Three

You can pull some pretty nice results out of a database in a single query if you know how to combine the WHERE, GROUP BY, and HAVING. If you have ever worked with a Sales Manager, you know they constantly want to know strange numbers, so let's say our Sales Manager says, "Can you tell me the average order size by state for all orders greater than 20? And don't bother with any average less 30.00" We say, "Sure, don't walk away, I'll print it out right now."

SELECT state
      ,COUNT(*)
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount) as avg
  FROM orders
 WHERE sale_amount > 20
 GROUP BY state
HAVING avg(sale_amount) >= 30
   AND max(date) >= '2008-12-31'

How to Do a Weighted Average

Consider the case of a table that lists test, homework and quiz scores for the students in a certain course. Each particular score is worth a certain percentage of a student's grade, and the teacher wants the computer to calculate each student's file score. If the table looks like:

STUDENT     | WEIGHT | SCORE
------------+--------+-------
NIRGALAI    |     40 |    90
NIRGALAI    |     35 |    95
NIRGALAI    |     25 |    85
JBOONE      |     40 |    80
JBOONE      |     35 |    95
JBOONE      |     25 |    70
PCLAYBORNE  |     40 |    70
PCLAYBORNE  |     35 |    80
PCLAYBORNE  |     25 |    90

Then we can accomplish this in one pull like so:

SELECT student
      ,SUM(weight * score) / 100 as final
  FROM scores
 GROUP BY student

The nice thing about this query is that it works even if data is missing. If a student missed a test, they automatically get a zero averaged in.

Conclusion: Queries Are Where It's At

The only reason to put data into a database is to take it out again. The modern database has powerful strategies for ensuring the correctness of data going in (the primary key, foreign key and other constraints) and equally powerful tools for pulling the data back out.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are: