|
|
VLDB Vision Defining Your Data
Warehouse Workload The key input to most large-scale engineering decisionsMy August 1998 column ("Is Your Data Warehouse Headed for VLDB Trouble?") was devoted to a simple test designed to help you figure out if you have the information you need to make a technical success of your large scale data warehouse. The test is built around the idea of the critical VLDB engineering inputs. If you don't have them defined, generally you are headed for trouble. The critical inputs are: • Response time requirements • Throughput requirements • Data freshness requirements • Data availability requirements (under what circumstances, if any, can the data be unavailable for use) • Operating schedule (for example, what are the batch windows, if any) • A database concept • An estimated workload. You need to know these things--or have working estimates for them--and you need projections of how they are likely to change over the lifetime of your scaling plan. When working with clients, I find that they often struggle with the concept of an estimated workload for a data warehouse. This is usually at the heart of the matter when data warehouse performance goes astray, so this column is devoted to the subject of defining a data warehouse workload. Definition of a Data Warehouse WorkloadA workload must be defined in terms of particular unit of time. This can be a day, a week, a month, a quarter, or even a year. A week is usually best. When there are longer cycles as well (monthly or quarterly), it usually works to define additional weekly workloads for the end of each cycle. For engineering purposes, think of a workload as consisting of several workclasses, such as interactive query, data transformation, batch load and update, data extraction, data mart download, and so on. As an example, let's use IQ as the name of the workclass representing interactive query. Each workclass is defined in terms of a set of transaction classes, a schedule, a set of transaction rates, and if relevant, a characterization of how transaction rates vary over the schedule. A schedule defines the time period over which the workclass is processed. Thus, a typical schedule for interactive query processing might be 6 a.m. to 10 p.m., Monday through Saturday. This is a total of 96 hours per week. A transaction class is an intuitive characterization of some significant element of the workload. For example, one class of interactive queries might consist of those queries that aggregate data on customers by their demographics. As an example, lets call this class IQ1. The average transaction rate defines how many transactions of a given class are expected to occur per unit of time over the scheduled period for the workclass. In data warehouses, usually it's best to work in queries per hour or queries per minute. Thus, if we had five IQ1 transactions per hour averaged over the 96 hours per week of interactive query operation, we would define five per hour as the average transaction rate. Note that, in this case, this translates to 5 3 96, or 480 transactions per week for this transaction class. The peak transaction rate defines how many transactions of a given class we expect in the busiest hour of the week for that class. A top-level characterization of a data warehouse workload is illustrated in Tables 1 and 2. The values are for purposes of illustration only. The main purpose of Table 1 is to provide a sense of the principal elements of a workload and the level of detail needed to establish a top-level picture of how each fits into the week.
Table 2 gives an example of an early definition of the transaction classes and associated rates for the interactive query workclass.
It is impossible to define all the transaction classes that will actually occur. In a data warehouse, nobody knows what the users will ask until they have access to the data with the right tools and applications. However, it is usually possible to make a reasonable projection--or a small set of reasonable alternative projections--to characterize the likely workload for a given phase of operation. In our example Table 2, we show a planner defining and estimating usage volume for 20 classes of interactive queries. The idea here is to define a set of classes which together account for most of the expected workload. In practice, usually I find that somewhere between 10 and 25 query classes will account for most of the expected workload When you have a version of Table 2 for each of the workclasses, you have actually completed a first pass definition of your workload. You will then have made your best guess at the principal types of work, the principal transaction classes for each type, and the average and peak rates at which they must be processed by the system. If you have gotten this far, no matter how rough your class definitions and your estimates, you will be farther along than most data warehouse planners ever get. How to Get the Initial Class Definitions and EstimatesThe rough class definitions and estimates I illustrate in Tables 1 and 2 come from one or more of the following sources: • Knowledge of what is typically done with data warehouses in your industry • User requirements development, via user interviews and studies • Data warehouse application planning. You can use user interviews and studies to understand the business interests of prospective users; you can then work with the users to identify transaction classes that solve known business problems significant to them. Once you have done this, it is generally possible to work with user organizations to estimate the number of users likely to have an interest in using a given transaction class and the number of times per day or week each is likely to run it. As I described in my February 1997 column, "An Application-Driven Approach," I believe that it is most fruitful to view the data warehouse as a vehicle for delivering business application solutions to the organization. Thus, it may be valuable for its capability to support ad hoc query and analysis, but the majority of the workload and the business value will come from the use of specific applications, whether developed in house or acquired. Thus, a common application today is campaign management, the process whereby targeted direct marketing efforts are created, conducted, evaluated, and improved. The transaction classes and the volumes for defined applications are often readily available or easily estimated. Finally, knowledge of the data warehouse workload for other organizations in your industry--either via user group contacts or consultants--can be a useful basis for early estimates. Don't fail to take into account what may be unique about your organization. But another company's prior experience may be a much better starting point than a blank sheet of paper. How the Workload Definition is RefinedThere are three main dimensions in which the workload definition, once created, gets refined: • Boundaries of transaction classes become clarified over time. • Transaction descriptions gradually become more concrete, and ideally evolve into characteristic SQL. • Estimated transaction rates become more solid. By characteristic SQL, I mean examples of SQL that define the principal performance challenge represented by the transaction class. Thus, there may be 1000 different queries you can write that would fall in class IQ1, entitled "aggregate customer by demographics." But having some idea how your marketers use data, you might decide that the query "What is the average disposable income of our married female customers in Chicago?" is intuitively typical. In my experience, this type of decision is usually pretty good, and you can always improved on it later if necessary. If class IQ1 represents 5 percent of your query volume and you make a reasonably good guess, it will probably be good enough. You might examine this question and decide that there are two quite different queries, both of which seem typical. If the queries appear to have similar performance requirements (for example, they would both always force a scan of the customer table), you might choose one and stay with it as a good enough estimator of the workload requirement. But if the queries really seemed quite different in their performance demands, if they both seemed typical and equally representative of the class, and if the class seemed fairly important, then you might split the class into two. You can't do this for every class, but you can do it for a few. It makes no real difference whether you end up with 15, 20, or 25 transaction classes for any given piece of the workload. A single transaction class might adequately represent some workclasses. It is most important to realize that it is much easier to improve a definition than to create one in the first place. I have been involved in, or led, workload definition for many projects by now, and I can promise you it nearly always feels the same way. It feels as if there is not nearly enough to go on. The initial estimates feel like little more than wild guesses. But recognize this: If you get together most or all the people who are likely to have an idea and make a guess, you are way ahead of the people who are afraid to try. What happens is that the very act of expressing the estimate--and attempting to refine it at various points in the project--forces to mind the questions that can clarify the answers. If you work with the users, the applications, the data, the business examples that you can tease out of requirements sessions, you gradually develop a reasonable feel for what the workload is likely to look like. Eventually, you end up with a set of tables that are as good an expression of the workload as anyone is likely to develop. How to Use the Workload DefinitionThe workload is your best model at any given point in the project of what the data warehouse has to be able to process in a given day or week. Its main uses are in performance and capacity planning, modeling, measurement, benchmarks, pilot projects and prototypes, and tuning. When the workload definition is precise enough, you can give it outside parties (such as vendors) along with certain other inputs, and they can actually run benchmarks for proposed platforms or run quantitative models. If you have your platform already, you can prototype your workload to explore database and application design questions. You can implement a pilot project with the real data and your representative transactions to get an idea of how the system performs. You can then gradually replace your synthetic transactions with real ones as the system develops. If you have a system in production, you can build concrete characterizations of the workload, gather data from the real system on actual transaction rates and resource requirements, and engage in useful modeling or benchmarking to explore the options for improving performance, and so forth. As I stressed in my last column, you are in the deepest trouble when you don't have any expression of what your workload might look like. Then nobody can employ a model, a benchmark, a prototype, or anything else to tell you how your system might perform. Using Alternative ScenariosPeople sometimes consider an effort to define a workload such as I have described--or to define other requirements--then quickly despair because they face so many uncertainties: We don't know how many customers will connect over the net for self service, we don't know which services will be most valuable to them, we don't even know how many customers we will have for our new product line, and so on. This is life. It is this way on every project I have ever worked on--and I have helped a lot of customers with a lot of VLDB projects. Here is the answer: Construct a small number of reasonable alternative scenarios and create workload projections for each of them. Then pick out a few that seem to represent the most important or most likely points on the spectrum and use those to drive your planning. Make reasonable assumptions. Be aggressive about getting the information to refine them. Then project your workloads and performance requirements and proceed methodically. Improve your estimates every chance you get. That's the best anyone can do. And if you do it, you are much better off than if you don't do any planning. Onward and UpwardThe workload is usually the most difficult of the critical engineering requirements to get. A rough estimate of a workload is much better than no idea. The worst thing is to hang back too long, afraid to guess, and therefore make all your engineering and management decisions with no idea of the workload. That's the sure path to VLDB ruin. But come up with a rough workload estimate and a reasonable idea of what the other critical inputs are, as I listed at the beginning of this article, and you have a basis for rational decision making--and a chance at making good VLDB implementation decisions. Make the best rough blueprint you can of your workload, using a structure such as I illustrate in Tables 1 and 2, refine it as quickly as you can, and use it as an input to your performance and scaling decisions. You will be one step closer to data warehouse success. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||