WinterCorp Logo


WinterCorp

245 First Street

Suite 1800

Cambridge, MA 02142

Phone: 617-695-1800

Fax: 617-848-3795

 

Contact us | Privacy

      

Intelligent Enterprise
September 1999

Be Aggregate Aware

Richard Winter


Recent releases of several major database engines have included optimizers that are: Aggregate-Aware. That is, the optimizer can be made aware of pre-computed aggregates stored in the database and exploit them to reduce query time and cost. It. s important to be aware that, although exploiting this feature might sound unequivocally like a good idea, you must apply these engines. capabilities to suit your particular situation . lest you increase your troubles rather than relieve them.

In DB2 Universal Database for Unix and NT, Versions 5.2 and 6.1, these optimizers are called automated summary tables (ASTs). In Informix. s Red Brick Data Warehouse 5.1, a facility named Vista supports the definition and use of stored aggregates.

In Oracle8i , summary tables are one case of materialized views, in which the view happens to contain aggregates. NCR is addressing this area in Teradata as well, although it has not yet delivered capabilities comparable to the other products discussed here.

In all cases there are three basic ideas:

  • The DBA controls summary table creation. Each table is, in effect, defined by means of a query; Oracle8i and Red Brick also allow the DBA to register preexisting summary tables, which the DBA or application may create.
  • Users need not be aware of the summary tables. The user states queries in terms of the underlying data or views; when the optimizer determines that referencing a summary table helps complete a query more efficiently than referencing base tables does, it rewrites the query to do so.
  • In DB2 and Oracle8i , the system automatically maintains aggregates. That is, when the underlying data changes, the database engine automatically updates the summary.

The result is like magic . queries using such functions as sum, count, and average simply run faster than before . often 10 or 100 times faster. In addition, users are freed from dependence on application-defined aggregates, the benefits are automatically delivered in many cases to end-user tools and analytical applications, and users are shielded from the complexity of understanding both summary tables and base tables . they need only know about base tables and/or the views relevant to their interests. Overall, few. if any. new database engines. capabilities in the last three years have the values of these summary table features for the large-scale data warehouse user.

This development is a big win for the DBA as well, who can now often be freed from the burden of maintaining the application code underlying the summary tables. creation and updating. Also, the DBA has a much reduced burden with respect to user education about summary tables. But it. s not a free lunch for the DBA, who now confronts the design challenge of choosing the summary tables to maintain.

Combinatorial Explosion

Summary tables. key design challenge, which is far worse than linear with scale, is that the possible number of aggregates rapidly becomes astronomical as the size and complexity of a database grows. The set of stored aggregates can readily become much larger than the base data. Thus it becomes easy to exceed any reasonable budget for space or update time.

For example, suppose you have a retail database with data on 36 months of activity at 2,000 stores and the following further characteristics. There are 100,000 products, 10,000 suppliers, ten million customers, and 10 billion transactions. Stores are grouped into 250 districts, 50 states, and 10 regions. Products fall into 100 categories (such as . audio. ) and 10 departments (such as . electronics. ).

Suppose that the most common customer groupings for purposes of analysis are by 10 age groups, 100 occupational groups, and 10 other demographic characteristics that average 10 values each. Suppose suppliers are grouped by 10 characteristics, each of which has 10 values. Finally, suppose that the divisions of time of interest are day, week, month, quarter, and year.

To give one example of the combinatorial effect, a table summarizing sales by product, store, day, and customer occupation has a potential row count of:

100,000 products * 2,000 stores * ~1,000 days * 100 customer occupationsfor a nominal total of 20 trillion combinations. This is actually 2000 times larger than the transaction table.

Fortunately, you won. t actually get 20 trillion rows in your summary table. As long as the summary is formed by a SQL group by on the fact table, it will not have more rows than the fact table. However, it will have a lot of rows in this example . plausibly billions . and this is just one summary.

The next aggregate that comes to mind, perhaps looking at sales by supplier aggregated several different ways, will plausibly contribute another billion rows.

It is easy to find yourself in a situation in which the first few aggregate sets you would want to keep around occupy more storage than the database itself. But, the time and resources required to maintain such a large aggregate cache in a data warehouse with large daily or weekly updates can also present a demanding problem.

The Design Challenge

In fact, as you begin to play around with the summaries of interest in a real application, you quickly run up against a tough tradeoff: If you try to cram many summaries into just a few tables, the row length becomes a problem. If you try to create many tables, unless you avoid the lower levels of all the hierarchical dimensions, the row count becomes a problem.

Either way, too much summary data will sink you, because you can. t afford to store or maintain more than a certain amount. But too little summary data will sink you, too, as users find that the simplest and most obvious questions (What was our volume of electronics sales in my store last week? How did that compare to sales in our average store for the same period? How about for the same week last year?) take a very long time to answer.

Principal Design Tradeoffs

Summary tables are much like indices. They are redundant information, derived from the base data. There is a cost associated with storing and maintaining each element of the summary. And, there is a benefit, in terms of saved time and effort in query execution, associated with each element of the summary.

The business value of a set of summary tables has two principal components.

First, summary tables may make it possible to satisfy response time requirements that would otherwise be infeasible. For example, if it is necessary to provide interactive response to queries concerning sales by district by department, fulfilling that requirement may be impossible without stored summaries.

Second, you can use summary tables to reduce the total cost of meeting a given set of operating requirements. For example, a common query involving aggregates might require 500 disk reads without a summary table but only five disk reads with it. The 100-to-1 reduction in disk activity will translate into a cost savings in handling that query. Accumulating the savings across all of the queries performed on a data warehouse in a month, employing a well-designed set of summary tables can result in a large reduction in hardware requirements for a given set of queries. Of course, the cost equation must also take into account the cost of maintaining the summary tables.

In fact, I have been able to see the tradeoffs clearly myself only in terms of a set of equations.

Design Tradeoff Equations

For a given summary table, Si, on the cost side: we have SCi, storage cost; MCi, maintenance cost; and MTi, maintenance time.

For a given query, Qj, we can say that QjTb, is the baseline query time, or the query time without summary tables. Similarly, let QjCb be the baseline query cost. Then, we define QjTi as the query time with summary table Si present and QjCi as the query cost with summary table Si present.

Then, if n is the number of summary tables and k is the number of queries, we have:

  1. Net system cost impact = Sum [SCi + MCi + Sum(QjCi – QjCb)]
  2. Query response time impact = Sum (QjTi – QjTb)where j ranges from 1 to n and i ranges from 1 to k in both equations. And, the key constraint tends to be:
  3. Sum (MTi) < (periodic batch window – (backup time + base data "update" time))

Under most circumstances, the goal is to make sure that query response time is improved, total system cost is not greatly increased and that the maintenance of the summary tables can be accomplished within the existing batch window.

However, it is sometimes difficult to get the information to plug into these equations. In practice, human effort is usually focused on a few key summary tables and a few key query requirements. When you can identify the classes of queries that are most common and/or most critical - or likely to be most common or most critical - in a given application, I have found that analyzing the design with a spreadsheet is practical and quite rewarding. You can use the formulas to structure your analysis. You then make approximations of some of the quantities and do some simple measurements and you can quickly position yourself to readily identify the major design advantages and disadvantages of various summary table approaches.

At the same time, you cannot afford to gather data and analyze tradeoffs in depth for every possible summary table in a large data warehouse. In my view, that's the reason the summary advisors were created in both Oracle and Informix Red Brick Data Warehouses.

Limitations of the Summary Advisors

The Summary Advisor is a useful concept and I believe it will prove to be a valuable tool to many users. However, it does have limitations and these may be critical in some larger or more demanding data warehouse settings. Some key issues are these:
  • Summary Advisors, at least at present, ignore the cost of maintaining the summary table. The cost/benefit equation is approximated with an estimate of the summary table's side. It is easy to imagine small summary tables that have a high cost of construction and maintenance because they require touching many rows of the base table; this is one example of a case that will result in the Advisor making suboptimal recommendations.
  • The row count of a summary table depends greatly on the cardinality, skew, and independence of the columns in the group by used to define it. Thus, if there are 1,000 days, 2,000 stores, and 100,000 products, there can in principle be 200 billion combinations of these values. There will be many fewer and, in a star schema, the total number of combinations will be less than the row count of the fact table. But, the number of unique combinations in the database might plausibly range anywhere between a few million and 1.5 billion. It's difficult to estimate this effect well in large databases, and how well today's summary advisors do it is unproven. If the estimates are off significantly, they will affect both cost and benefit.
  • Summary Advisors are used most readily to analyze query history, treating all queries as equally important. But in practice, some queries are more important than others. Furthermore, it's often important to deliver good performance to users with new or different requirements; today's summary advisors don't help much with that (although Oracle does have an "analysis" mode for its Summary Advisor that lets you specify an expected workload; by weighting the expected workload one could get the Advi- sor to emphasize certain queries more).

All of these issues can be addressed in time by the database vendors and probably will be as this type of facility evolves. Meanwhile, though, DBAs need to be aware of the tradeoffs in creating and maintaining summary tables, use the Summary Advisor appropriately, and apply some skill and intuition to the design problem that remains.

It makes sense to rely heavily on a Summary Advisor for applications where performance demands are modest and DBA resources are not available to perform the necessary analysis. For larger tables and more performance-sensitive applications, however, a design review - at the very least - is in line.

Regardless of how you address the design problem, there is no question that aggregate-aware optimizers and system-managed summary tables are an enormous boon. Effectively used, they yield large gains in performance, cost, and ease of use. They also free the DBA from manual summary-table maintenance chores in many cases. If you have a large data warehouse, how you design and implement summary tables, and how well you exploit the capabilities of your database engine for dealing with them, will be an important factor in the scalability of your solution.

Richard Winter is a specialist in large data base technology and implementation, and is president of Boston-based Winter Corp. (You can reach him via email atRichard.Wintercorp.com or by fax at 617-338-4499.)