Winter Corporation


WinterCorp

411 Waverley Oaks Road

Waltham, MA 02452

Phone: 781.642.0300

Fax: 781.642.7222

 

Contact us | Privacy

      

VLDB Vision
April 1998

What your mother never told you about database size

How to Avoid the Heartbreak of Teraflation

Richard Winter

Are you getting tired of the empty claims circulating about database size? Databases are getting bigger at an impressive rate, as our 1997 VLDB Survey demonstrates. (The latest results were summarized in the September 1997 issue.) At the 1997 VLDB Summit, we honored United Parcel Service (UPS) for having the largest relational database in operation with more than three terabytes of user data.

That's big all right. But it's not big enough for the industry's latest game of liars' poker. You almost can't go to a vendor presentation lately without hearing about a database of 25 or 50 terabytes. But in these cases, when you ask the right questions, they always boil down to something much smaller--so far, always in the single-digit terabyte range. What your mother never told you is that those databases may be as much as ten times smaller than they sound. Why does this matter? Because if you have five terabytes of data you need to know whether the biggest database your vendor has ever seen contained five terabytes or .5 terabytes!

We can't really blame the vendors, of course. Once one of them starts playing fast and loose with the figures, the others either have to follow suit or look like they are falling behind in the race. They do it because we, the users and their consultants, are impressed. Really, we invite them to do it.

So what do we do about this inflation of database size--this teraflation? The first step is to understand how disk space is actually used in VLDBs; then we can deal with the issue.

TPC-D DATA
Let's begin with one of the niftiest sources of published information on how database products behave: the TPC-D benchmark results. There is an interesting figure published in all TPC-D executive summaries: a ratio officially named "disk size/data size." This number tells you the ratio between the disk actually used in running the benchmark and the amount of user data loaded into the database. I refer to it as the disk-to-data ratio.

Because we are talking about big databases, let's look at the biggest TPC-Ds--those with scale factors of 300 (300 gigabytes of user data) and 1,000 (a terabyte of user data). These are benchmarks run on one terabyte of user data. See Table 1 for the disk-to-data ratios for the 10 that had been posted on the Web site (www.tpc.org) as this article went to press. For these tests, the ratio was between 5.25 and 10.24. That means that roughly five to 10 terabytes of disk were used to run these tests--even though they involved exactly one terabyte of data.

Table 1.

Disk-to-data ratios for the published, terabyte-scale TPC-D Benchmark Results.

 

Database
Product

Hardware
Platform

Publication
Date

Disk-to-Data
Ratio

Oracle 8.0 Sun Starfire 6/23/97 5.44
Teradata V2R2 NCR
Worldmark 5100
7/25/97 5.25
Teradata V2R2 NCR
Worldmark 5150
10/27/97 5.41

You can imagine how easily database size gets inflated in this situation. It's very easy for the vendor to focus on the 10 terabytes of disk rather than the one terabyte of data. Of course, this doesn't happen that much for the TPC-D because there is a defined vocabulary for discussing it and a prescribed set of rules for presenting results. People pretty much have to focus on the volume of user data that is fed to the database, which is a good measure of size. Generally, people refer to the above results as "one terabyte" TPC-D results, which is uncommonly accurate for our industry.

REAL LIFE
In the discussion of real implementations, there is no standard vocabulary and there are no rules for how the description of size must be presented. So it's natural for the vendor to seize on the biggest number in sight: the aggregate capacity of all the disk drives connected to the configuration. That is exactly what they do. To make that number as large as possible, they generally include:

  • Irrelevant disk, which has nothing to do with the database (that is, the disk for the file system, the operating system swap space, and so forth). Actually, this category of disk is entirely relevant to the user's budget, but it is, unfortunately, irrelevant to the amount of user data that is being stored in the database.
  • Redundant disk, which is there only for data protection (such as mirroring, RAID, and product-specific backup facilities such as Teradata fallback files). Redundant disk is used hold copies of the data for immediate use in case the original is damaged, so they represent double counting when it comes to what most users really mean when they inquire about database size.
  • Freespace, which is available for workspace or to provide for future database growth. Freespace often has more to do with the amount of disk that was convenient or economical to acquire at the last purchase point than it does to any ordinary meaning of database size.
  • Summary data, which consists of precomputed aggregates that are stored so that predictable queries can be serviced more rapidly. Depending on the product, these may or may not be visible to the end user.
  • Indexes are structures maintained by the database engine to speed access to the data. Such structures are invisible to the end user or application, are created under DBA control, and are also a form of redundant data storage.
  • User data, which represents the actual information content of the database. Examples would be base facts about customers, products, sales, stores, and so forth. In a relational database, these are the base tables.

WINTER VLDB SURVEY DATA

Now let's take a look at what real-life VLDBs look like in these terms. Figure 1 shows how space is used on average in 22 transaction processing databases portrayed by respondents to the 1997 survey.

Fig.1.

Some noteworthy points are revealed in this chart. First, the disk-to-data ratio in these real-life databases is almost exactly two to one; that is, about half the disk space under the control of the DBMS is actually being used to store data. Second, summaries aren't used much here, and that makes sense: Summaries are primarily a decision-support phenomenon and these are transaction processing databases. Third, indexes are the primary use of space after user data, accounting for 27 percent of the space used. This also makes sense; transaction processing indexes are built mainly for primary keys (such as account number) and for keys that are nearly unique (such as a customer number used as a foreign key in the account). So each transaction processing index tends to be large.

Figure 2 shows the same breakdown for the 41 decision-support databases profiled in the survey. Here, the average disk-to-data ratio is 2.4, but indexes and summaries account for approximately equal amounts of space.

A few observations on the DSS data are in order. First, the disk-to-data ratio in practice varies widely from a low approaching 1 to a high of 9.63. The larger ratios show up in the very large databases, too: One respondent described an enterprise data warehouse with 750GB of user data and 5.3 times that amount of disk. (And I can remember hearing the vendor boast about this as a 4TB data warehouse last spring.)

Second, the majority--about 70 percent--of these decision-support databases exhibit a disk-to-data ratio in the range of 1.4 to 3.

Fig. 2

I look forward to digging deeper into this particular point. My own experience in practice is that large decision-support databases on parallel platforms usually require three to five times as much disk as they have data to meet the demanding performance and availability requirements typical of today's applications. But the data tells the story: The majority of users responding to our survey report that they are making do with less.

DIRECTIONS IN THE USE OF SPACE
I believe that the disk-to-data ratio is going to continue to rise in VLDBs, particularly in decision support. Three trends contribute to this growth.

First, large-scale DSS indexing is improving. In the last year, Informix, Oracle, and Teradata have each introduced some form of join indexing. Informix and Oracle have introduced bitmap indexing. Red Brick data warehouse and Sybase IQ, both of which make heavy use of advanced indexing techniques, have experienced increased use in the large-scale data warehouse. Because historical data can be indexed once and then retrieved many times, indexing is fundamentally a good option in large-scale decision support. And the more indexing is done in practice, the more disk space will be used.

In a conversation with Carrie Ballinger and Pekka Kostamaa of NCR, I confirmed that recent experience with Teradata on TPC-Ds exhibits this growing use of join indexes, which increases the disk space but pays off in better performance and price/performance.

Second, disk prices continue to drop. Drive prices dropped 10 percent in the last quarter of 1997 alone according to Michael Brown, chief executive of Quantum. This price drop continues the trend of the last few years of 40 to 50 percent annual decline in prices.

Third, the disparity between disk read/write capacity and disk storage capacity continues to increase. That is, the 2GB drives that were standard until recently have given way to 4GB, 9GB, and 23GB drives--all without increases in the number of read/write heads per drive. Thus, in high-performance applications, there is an increasing trend to leave part of the disk empty.

With all these trends operating, I believe that we will see larger disk-to-data ratios--not only in benchmarks but in practice on VLDBs.

DATA COMPRESSION
One technological development goes against the more widely felt trends above: data compression. A few database management products compress data. Two notable examples are Sybase IQ and Red Brick Data Warehouse. Users are most commonly interested in compression today because of the impact of compression on disk I/O. If you can compress 100GB table in 20GB of storage, then you can scan the table five times faster. Of course, you may give back at least part of that gain as you decompress the data. But you can come out ahead if clever algorithms are used and the hardware is configured accordingly. Used properly, compression can buy response time and consequently make data more accessible to users. Data compression also affects the ratios discussed in this article because it makes both "user data" and "total disk space" smaller.

TERAFLATION
So teraflation is here to stay. Users need to be wary on two counts.

First, vendors will continue to exaggerate claims about database size to make it sound as though they have more of a VLDB track record than they really do, which is dangerous because it gives the user a false sense of security. If you are the first user of product X to build a decision-support database with two terabytes of user data, you need to know that and plan accordingly to manage the very real risks inherent in your project. So when you hear a vendor make a claim about a database's size, get the size broken down into the components defined above. Don't think of one site as a "track record," and don't think a single measure (such as bytes of user data stored) tells the whole story about performance or scalability. If you are doing an evaluation, try to get a more complete picture of practice with multiple sites and multiple dimensions of database performance and scale.

Second, when you do your capacity planning, recognize that disk-to-data ratios range widely and may be changing rapidly. Also, they are affected by multiple factors, including drive characteristics, physical database design, indexing strategy, data availability, and performance requirements. Be aware that your actual disk requirement can be two, three, six, or even more times your requirement for stored user data. If your application is mission-critical, be sure to select a data availability strategy that meets your requirements. If that requires additional disk, so be it. Recognize that having more disk drives provides more read/write heads and can thus increase performance. And recognize that data storage requirements tend to increase rapidly.

And if you are still wondering how big those databases really are this year, stay tuned. The biggest production databases of 1998 were highlighted at the VLDB Summit Conference in Beverly Hills, Calif. in March, and you will be able to read more about them right here soon.

Top of page