|
|
Intelligent
Enterprise Databases:
Back in the OLAP Game I remember when I first heard about multidimensional query in the 1970s. An acquaintance working on an MIT research project told me about the system he was building. He explained that you can think of the data as being organized in a cube and picture the queries as operations on rectangular slices cut in any direction in that cube. He said these 3D queries were hard to do in a relational database. Moreover, if you went to four or more dimensions, the cube would become an n-dimensional space and the database problem would worsen. He was building a system that could run multidimensional queries simply and efficiently. It had the right language constructs and data structures. His engine would outperform all existing relational databases on this type of query. Shouldnt the Database Engine Do That?I had never before heard about this spatial conceptualization of the problem, although I had been working for years with customers doing all kinds of aggregation. I liked the cube user view, but I wondered if the problem couldnt be solved well enough within the database engine. I figured I could make my engine (at that time, a Model 204) do this stuff efficiently. This guy, I thought, doesnt really appreciate what database engines can do. Besides, I could tell by the way that he described the data structures that he wasnt considering how to address large volumes of data As it turns out, the acquaintance was Jeff Stamen, who later became a well-known and successful executive with MDS, IRI Software, and finally Oracle and has a long association with what is now called the OLAP market. He was creating across the hall from me what, I assume, would later become Express and, now, Oracle Express In the intervening years, hundreds of millions of dollars have been spent on OLAP products, which have become extraordinarily important in data warehousing and decision support. People now use OLAP products routinely on data volumes larger than Jeff and I could have imagined in the 70s But OLAP engines as we know them, although popular, are not the answer for VLDBs. With enough data, dimensions, or aggregates, their performance declines. And although industry experts expect OLAP engines to improve in performance and scalability, VLDBs are likely to outpace their gains I worked with Model 204 for several years after that conversation, but I never did get around to implementing that cube idea in the database, which by that time had become capable of such an approach. And as recently as two years ago, no major relational database product could aggregate data as well as OLAP products, even on modestly sized data sets So this idea of a separate OLAP engine has worked out pretty well for the last 20 years. Evidently, Jeff Stamen was on to something. Now, however, many people would agree that a database engine should service requests for aggregateseven aggregates across very large volumes of data efficiently. Using the database engine more doesnt mean the end of OLAP engines, but it may mean that you can manage many OLAP applications better by keeping data and some or all aggregates in the relational database. Why do the aggregation within the database engine? First, aggregation is often a part of some larger sequence of operations the machine must execute on detailed data, which is available only in the database itself. Second, queries often combine aggregation with other SQL operations or stored procedures the OLAP engine might not functionally support. Third, integrating the aggregation with other database operations can provide additional opportunities for optimization, thereby boosting performance and efficiency. Fourth, as a practical matter, the leading database engines have more powerful capabilities for parallel operation and optimization than OLAP products. If OLAP products can get the database engines to do some of their work for them, it will reduce the need to replicate storage of aggregates. The Database Engine DoesThis year Red Brick Systems Inc. (recently acquired by Informix Software), IBM, and Oracle in that order introduced substantial aggregation capabilities most notably, an aggregation engine, aggregate cache, and aggregate-aware optimizer The engine computes aggregates efficiently in advance; the aggregate cache stores the aggregates for later use. Because users call frequently for certain aggregates (for example, sales by store, day, week, or month) and because aggregates are often expensive to compute, this method can save a lot of time and resources. The aggregate-aware optimizer recognizes when retrieving a cached aggregate is more efficient than computing a new one, and it develops the query plan accordingly The following example illustrates performance improvement with these aggregation capabilities. Suppose you want to total sales by store for each of the last four quarters for each of your 500 stores. If you write a SQL request to calculate this information, it will yield 43500= 2,000 totals. If you perform this request with ordinary query processing techniques, it will require the engine to read every row of the sales table for the last year. In a large retail operation, such a procedure could require touching billions of rows. As an alternative to this massive undertaking, if the database engine has precomputed and stored the quarterly totals, each request necessitates a relatively quick retrieval of only 2,000 aggregates from the cache. If the sales table contains two billion rows, aggregate caching reduces the elapsed time and resource requirement by a factor of a million or more. The larger the collection of detailed data, the greater the benefit from using cached aggregates. Hot cached aggregates may well remain in the buffer pool which, for many queries, will eliminate the need for hundreds of millions of disk I/O reads. A remarkable aspect of this feature is that it doesnt require changes or special constructs at the SQL level to exploit it in a query. Even queries written before the feature was installed will exploit the feature. If they contain an aggregate, the optimizer will recognize it and use precompiled aggregates to reduce processing effort IBM chose to deliver its new aggregation capabilities in DB2 UDB 5.2, released in September 1998, in the form of Automated Summary Tables (ASTs). You define these tables with DDL statements similar to those used to define views. The DBMS materializes these tables; then the optimizer automatically applies aggregates stored in them as it evaluates queries. You can index the ASTs or replicate them across nodes of an MPP system in order to speed access. The end user is not aware of the ASTs The DBMS is responsible for keeping the ASTs updated. A DBA option determines the method for refreshing them. Along with its aggregate facilities, IBM also introduced join indexes in DB2 UDB 5.2. The two work well together to produce a dramatic increase in efficiency, because the optimizer gives priority to aggregate indexes on a single table or a join of tables. Red Brick uses a different tactic in an optional Red Brick Data Warehouse component named Vista, which it released in the first quarter of 1998. Based on observed usage patterns, a component of Vista called the AggregateAdvisor will recommend a set of aggregates for optimal performance. These may be the aggregates that actually appear, or they may be an underlying set of aggregates that Vista recommends Based on DBA review of the recommendations, Vista computes and stores the aggregates. It then automatically and transparently rewrites queries to apply the stored aggregates Red Brick has had its star index, a form of join index, for several years. The optimizer automatically uses the star index to compute aggregates and execute queries that apply the cached results In November, Oracle announced version 8i, due for delivery in late 1998. Although marketing positioned it as Oracle for the Internet, this version contains noteworthy advances in support of large-scale data warehousing and is capable of aggregate storage and management. As with Red Brick and DB2 UDB, the Oracle engine will transparently rewrite SQL queries to exploit precomputed aggregates. Whats interesting is that Oracle 8i supports materialized views. When the DBA requests a materialized view, the engine creates and stores a copy of the data defined by the view. Any such data is necessarily a replica or derivation of other data in the database. The engine then takes responsibility for maintaining the materialized view and keeping it consistent with the data from which it is derived. The precalculated aggregate is simply a specific case of the materialized view. This powerful capability is likely to have many other applications. For example, it lets you materialize denormalized forms of the data for fast performance on certain data warehouse queries; meanwhile, the base tables and, if you wish, the user view of the data, can remain normalized. A notable feature of Red Brick and Oracle 8i is that they store the metadatas dimensional hierarchies. So the information that zip codes roll up to states, which roll up to districts, which roll up to regions, becomes part of the metadata. The Query Rewriter will then exploit this information, along with its knowledge of which aggregates are precompiled, to find the best way to evaluate a query. For example, if presented with a query that refers to sum of sales by month, an aggregate not yet computed, it will automatically rewrite the query to use the stored aggregate sum of sales by week. ScalabilityBecause I havent tested any of the implementations scalability, I cant comment on their ability to manage large numbers of stored aggregates, perform large refreshes, address many concurrent queries, and execute other large operations. So questions remain about how broadly applicable they are to VLDBs. Nonetheless, given the basic approach of these aggregation products, they will likely be significantly applicable to VLDBs. Unless the implementation is terribly flawed, performance is bound to improve when you replace a billion row accesses with one, or even a few thousand, retrievals of stored aggregates This method doesnt just double multidimensional query efficiency; it promises performance gains of 100 to 1,000 times in most aggregate operations. And as illustrated in the preceding example, improvements on the magnitude of one million times are plausible I look forward to implementing and measuring the performance of the new type of aggregation facility Red Brick, IBM, and Oracle introduced. Only then will I be able to reach more definitive conclusions. With a reasonably successful implementation, these facilities will greatly influence the use of VLDBs. Most VLDB applications do not deliver detailed data to the user. Human beings have no way of directly understanding a billion call records or a billion point-of-sale transactions. Rather, most decision-support applications are about analyzing and presenting aggregate data. And in the handling of aggregate data, these vendors have taken a major step forward this year. RESOURCES |