WinterCorp Logo


WinterCorp

245 First Street

Suite 1800

Cambridge, MA 02142

Phone: 617-695-1800

Fax: 617-848-3795

 

Contact us | Privacy

      

Winter Corporation VLDB News

Intelligent Enterprise
December 1998

Databases: Back in the OLAP Game

Richard Winter

I remember when I first heard about multidimensional query in the 1970s. An acquaintance working on an MIT research project told me about the system he was building. He explained that you can think of the data as being organized in a cube and picture the queries as operations on rectangular slices cut in any direction in that cube. He said these 3D queries were hard to do in a relational database. Moreover, if you went to four or more dimensions, the cube would become an n-dimensional space and the database problem would worsen.

He was building a system that could run multidimensional queries simply and efficiently. It had the right language constructs and data structures. His engine would outperform all existing relational databases on this type of query.

Shouldn’t the Database Engine Do That?

I had never before heard about this spatial conceptualization of the problem, although I had been working for years with customers doing all kinds of aggregation. I liked the cube user view, but I wondered if the problem couldn’t be solved well enough within the database engine. I figured I could make my engine (at that time, a Model 204) do this stuff efficiently.

This guy, I thought, doesn’t really appreciate what database engines can do. Besides, I could tell by the way that he described the data structures that he wasn’t considering how to address large volumes of data

As it turns out, the acquaintance was Jeff Stamen, who later became a well-known and successful executive with MDS, IRI Software, and finally Oracle and has a long association with what is now called the OLAP market. He was creating across the hall from me what, I assume, would later become Express and, now, Oracle Express

In the intervening years, hundreds of millions of dollars have been spent on OLAP products, which have become extraordinarily important in data warehousing and decision support. People now use OLAP products routinely on data volumes larger than Jeff and I could have imagined in the ’70s

But OLAP engines as we know them, although popular, are not the answer for VLDBs. With enough data, dimensions, or aggregates, their performance declines. And although industry experts expect OLAP engines to improve in performance and scalability, VLDBs are likely to outpace their gains

I worked with Model 204 for several years after that conversation, but I never did get around to implementing that cube idea in the database, which by that time had become capable of such an approach. And as recently as two years ago, no major relational database product could aggregate data as well as OLAP products, even on modestly sized data sets

So this idea of a separate OLAP engine has worked out pretty well for the last 20 years. Evidently, Jeff Stamen was on to something.

Now, however, many people would agree that a database engine should service requests for aggregates—even aggregates across very large volumes of data — efficiently. Using the database engine more doesn’t mean the end of OLAP engines, but it may mean that you can manage many OLAP applications better by keeping data and some or all aggregates in the relational database.

Why do the aggregation within the database engine? First, aggregation is often a part of some larger sequence of operations the machine must execute on detailed data, which is available only in the database itself. Second, queries often combine aggregation with other SQL operations or stored procedures the OLAP engine might not functionally support. Third, integrating the aggregation with other database operations can provide additional opportunities for optimization, thereby boosting performance and efficiency. Fourth, as a practical matter, the leading database engines have more powerful capabilities for parallel operation and optimization than OLAP products. If OLAP products can get the database engines to do some of their work for them, it will reduce the need to replicate storage of aggregates.

The Database Engine Does

This year Red Brick Systems Inc. (recently acquired by Informix Software), IBM, and Oracle — in that order — introduced substantial aggregation capabilities — most notably, an aggregation engine, aggregate cache, and aggregate-aware optimizer

The engine computes aggregates efficiently in advance; the aggregate cache stores the aggregates for later use. Because users call frequently for certain aggregates (for example, sales by store, day, week, or month) and because aggregates are often expensive to compute, this method can save a lot of time and resources. The aggregate-aware optimizer recognizes when retrieving a cached aggregate is more efficient than computing a new one, and it develops the query plan accordingly

The following example illustrates performance improvement with these aggregation capabilities. Suppose you want to total sales by store for each of the last four quarters for each of your 500 stores. If you write a SQL request to calculate this information, it will yield 43500= 2,000 totals.

If you perform this request with ordinary query processing techniques, it will require the engine to read every row of the sales table for the last year. In a large retail operation, such a procedure could require touching billions of rows.

As an alternative to this massive undertaking, if the database engine has precomputed and stored the quarterly totals, each request necessitates a relatively quick retrieval of only 2,000 aggregates from the cache. If the sales table contains two billion rows, aggregate caching reduces the elapsed time and resource requirement by a factor of a million or more. The larger the collection of detailed data, the greater the benefit from using cached aggregates. “Hot” cached aggregates may well remain in the buffer pool — which, for many queries, will eliminate the need for hundreds of millions of disk I/O reads.

A remarkable aspect of this feature is that it doesn’t require changes or special constructs at the SQL level to exploit it in a query. Even queries written before the feature was installed will exploit the feature. If they contain an aggregate, the optimizer will recognize it and use precompiled aggregates to reduce processing effort

IBM chose to deliver its new aggregation capabilities in DB2 UDB 5.2, released in September 1998, in the form of Automated Summary Tables (ASTs). You define these tables with DDL statements similar to those used to define views. The DBMS materializes these tables; then the optimizer automatically applies aggregates stored in them as it evaluates queries. You can index the ASTs or replicate them across nodes of an MPP system in order to speed access. The end user is not aware of the ASTs

The DBMS is responsible for keeping the ASTs updated. A DBA option determines the method for refreshing them.

Along with its aggregate facilities, IBM also introduced join indexes in DB2 UDB 5.2. The two work well together to produce a dramatic increase in efficiency, because the optimizer gives priority to aggregate indexes on a single table or a join of tables.

Red Brick uses a different tactic in an optional Red Brick Data Warehouse component named Vista, which it released in the first quarter of 1998. Based on observed usage patterns, a component of Vista called the AggregateAdvisor will recommend a set of aggregates for optimal performance. These may be the aggregates that actually appear, or they may be an underlying set of aggregates that Vista recommends

Based on DBA review of the recommendations, Vista computes and stores the aggregates. It then automatically and transparently rewrites queries to apply the stored aggregates

Red Brick has had its star index, a form of join index, for several years. The optimizer automatically uses the star index to compute aggregates and execute queries that apply the cached results

In November, Oracle announced version 8i, due for delivery in late 1998. Although marketing positioned it as “Oracle for the Internet,” this version contains noteworthy advances in support of large-scale data warehousing and is capable of aggregate storage and management. As with Red Brick and DB2 UDB, the Oracle engine will transparently rewrite SQL queries to exploit precomputed aggregates.

What’s interesting is that Oracle 8i supports materialized views. When the DBA requests a materialized view, the engine creates and stores a copy of the data defined by the view. Any such data is necessarily a replica or derivation of other data in the database. The engine then takes responsibility for maintaining the materialized view and keeping it consistent with the data from which it is derived. The precalculated aggregate is simply a specific case of the materialized view.

This powerful capability is likely to have many other applications. For example, it lets you materialize denormalized forms of the data for fast performance on certain data warehouse queries; meanwhile, the base tables and, if you wish, the user view of the data, can remain normalized.

A notable feature of Red Brick and Oracle 8i is that they store the metadata’s dimensional hierarchies. So the information that zip codes roll up to states, which roll up to districts, which roll up to regions, becomes part of the metadata. The Query Rewriter will then exploit this information, along with its knowledge of which aggregates are precompiled, to find the best way to evaluate a query. For example, if presented with a query that refers to “sum of sales by month,” an aggregate not yet computed, it will automatically rewrite the query to use the stored aggregate “sum of sales by week.”

Scalability

Because I haven’t tested any of the implementations’ scalability, I can’t comment on their ability to manage large numbers of stored aggregates, perform large refreshes, address many concurrent queries, and execute other large operations. So questions remain about how broadly applicable they are to VLDBs.

Nonetheless, given the basic approach of these aggregation products, they will likely be significantly applicable to VLDBs. Unless the implementation is terribly flawed, performance is bound to improve when you replace a billion row accesses with one, or even a few thousand, retrievals of stored aggregates

This method doesn’t just double multidimensional query efficiency; it promises performance gains of 100 to 1,000 times in most aggregate operations. And as illustrated in the preceding example, improvements on the magnitude of one million times are plausible

I look forward to implementing and measuring the performance of the new type of aggregation facility Red Brick, IBM, and Oracle introduced. Only then will I be able to reach more definitive conclusions.

With a reasonably successful implementation, these facilities will greatly influence the use of VLDBs. Most VLDB applications do not deliver detailed data to the user. Human beings have no way of directly understanding a billion call records or a billion point-of-sale transactions. Rather, most decision-support applications are about analyzing and presenting aggregate data. And in the handling of aggregate data, these vendors have taken a major step forward this year.

RESOURCES
IBM: www.ibm.com
Oracle: www.oracle.com
Red Brick Systems Inc.: www.redbrick.com