|
|
Intelligent Enterprise Check
Your Values
In a relational database, data is skewed when the distinct values in a column are not uniformly distributed over the tables rows. For example, suppose you have a table of 10 million customers and a column called Country. Your business is based in the United States and 60 percent of your customers are domestic, 10 percent are in Canada, 2 percent are in France, and the other 28 percent are in 22 other countries. If the data wasnt skewed, each of the 25 countries would have exactly four percent of the customers. Because the percentage varies by country, the Country column in this table is skewed. Skew is a common phenomenon in databases, partly because of
the Pareto principle, known more widely in business as the 80/20
rule. The more general statement of this principle is A
minority of the elements accounts for a majority of the value.
Thus, a small percentage of the products accounts for most of
the complaints, a small percentage of the customers accounts
for most of the sales, and so on. Its also important to
evaluate queries that select sets of rows from skewed columns.
Database products differ significantly in how they deal with
skew. Skew in Predicate EvaluationPredicate evaluation is the act of selecting from a table
the rows that satisfy given criteria. In SQL, a single table
predicate appears in the To understand the implications of skew, lets review an example. Suppose your Customer table of 10 million customers is stored 20 rows per page; storing it, therefore, takes 500,000 pages. Remember that 60 percent of customers are in the United States and 2 percent are in France. In this hypothetical table, 50 percent of customers are female and 20 percent of the females own a home. The predicate female customers in the United States who own a home would select six percent (.6*.5*.2) of the customers, or 600,000 rows (assuming the values in the Gender, Country, and Homeownership columns are independenta subject for another day). By contrast, female customers in France who own a home selects .02 percent (.02*.5*.2) of the customers, or 20,000 rows. If you think of the values in this query as variables, the
selection expressions are identical. That is, in SQL, the WHERE Country=#c# AND Gender=female AND Homeowner=yes Thus, only the value of Country changes from one case to the nextthe structure of the WHERE clause remains constant. In fact, such a WHERE clause, or predicate, could appear in a stored query, in which the end user supplies only the country name. However, in one case (United States) the predicate selects 600,000 rows and in the other (France) it selects 20,000 rows. Now think what happens when the user wants the average income of the customers who fit this profile. Recall that the customer table occupies 500,000 pages on disk. In the case of the 600,000 U.S. female homeowners, the selected rows make up, on average, 1.2 of the rows per page. The best strategy is a full table scan of all the pages. But when it comes to the 20,000 French female homeowners, the query selects far less than one row per page. In fact, only 20,000 out of 500,000 pages could possibly contain a qualifying row. In this case, assuming the necessary indexes exist and can be combined by the optimizer, it is far more efficient to do an index scan, reading 25 times fewer pages than with a full table scan. So now you can see the nub of the problem with skews: The optimal query plan depends on the choice of values in the query. Query optimization plans in most commercial database products are sensitive to only the choice of columns. Furthermore, most query optimizers dont collect or use the statistics that make them sensitive to the implications of values within a column. In fact, in most products, the only statistic retained is the number of distinct values in the column (the column cardinality). In this situation, the optimizer assumes that no skew is presentthat is, that the values are uniformly distributed over the tables rows. As you can see from the example, this assumption of uniform distribution can be very wrong. Scanning a full table to retrieve the rows for French female homeowners results in 500,000 disk reads rather than 20,000. Assuming 50 reads per drive per second, this scan represents 9,600 additional drive-seconds of work. Thus, in a system doing single threaded I/O, this scan prolongs query time by 160 minutes. Even if query I/O is done 10 ways in parallel, it adds 16 minutes to the query timechanging it from an interactive query that completes in a few seconds to one that calls for an extended coffee break while waiting for a response. Whats worse, from a system point of view, if the optimizer fails to take skew into account, it occupies the disks unnecessarily for hours. This issue is critical to major database scalability for the following reason: It doesnt make much difference when tables are small, and unnecessary full table scans are completed in a few seconds. It matters a great deal, however, when tables are large. Whats more, even though the relationship between increasing table size and increasing query response time is linear in my example, the effect on system work queues can easily become worse than linear in a busy, multiuser environment.
Skew in Join PerformanceSkew influences join performance in a similar way. Join performance also affects many common queries and is more difficult to characterize briefly, because of the complexity of join processing. Heres an example: Suppose you have a billion rows in a transaction table stored with your customer table. Transactions are stored 50 per page, resulting in 20 million stored disk pages. Each transaction row includes a customer ID and product ID. In the transaction table, the product column is skewed. For example, there are more transactions for televisions (20 million) than there are for alarm systems (5,000). Now consider two example joins. If you ask for the average income of U.S. customers who bought television sets, the system needs to join six million customer rows to 20 million transaction rows. This operation requires the system to read nearly every page of both tables, which a good optimizer will choose to do before probably selecting a sort-merge join. If you ask for the average income of French customers who bought alarm systems, the optimizer should realize that there are 5,000 alarm systems and 200,000 French customers. Assuming that the product column in the transaction table is indexed, it should probably perform a nested loop join, driving the join from the transaction side. The particular join choice will vary with other factors that are too involved for this short discussion. Some products now process such a case with a join index, which changes the considerations. But, regardless of the join techniques available, the system is susceptible to a large adverse effect if the optimizer is insensitive to skew. In most database products, failure to optimize for the skew with tables the size of my Customers example would result in several million otherwise unnecessary disk reads. As illustrated in the preceding examples, proper handling
of data skew can spell the difference between success and failure
in a large database project, particularly a large data warehouse
project. If you sense that you may be critically dependent on
such a capability or if your database is about to get much larger,
be sure to consider data skew as early in the project as you
can (see the sidebar Get Skew Wise). If you dont
have in-house expertise to understand, assess, and quantitatively
resolve issues related to data skew, you should get expert outside
assistance. Nobody can afford to do millions unnecessary disk
reads for common queries. Nobody can afford to maintain indexes
infrequently used because of skew effects. And few can afford
to ignore the implications of skew in a large data warehouse.
|