Winter Corporation


WinterCorp

411 Waverley Oaks Road

Waltham, MA 02452

Phone: 781.642.0300

Fax: 781.642.7222

 

Contact us | Privacy

      

Winter Corporation VLDB News

Intelligent Enterprise
June 1999

Check Your Values

Richard Winter


Standard query optimization algorithms don’t consider whether data is skewed. In an environment with high data volume, queries on skewed data that aren’t correctly optimized can greatly tax system resources, not to mention user patience. If you have skewed data and don’t have time to wait for business-critical information, you need to understand how skew affects query performance and how to select a database platform that meets your requirements.

In a relational database, data is “skewed” when the distinct values in a column are not uniformly distributed over the table’s rows. For example, suppose you have a table of 10 million customers and a column called Country. Your business is based in the United States and 60 percent of your customers are domestic, 10 percent are in Canada, 2 percent are in France, and the other 28 percent are in 22 other countries. If the data wasn’t skewed, each of the 25 countries would have exactly four percent of the customers. Because the percentage varies by country, the Country column in this table is skewed.

Skew is a common phenomenon in databases, partly because of the Pareto principle, known more widely in business as the 80/20 rule. The more general statement of this principle is “A minority of the elements accounts for a majority of the value.” Thus, a small percentage of the products accounts for most of the complaints, a small percentage of the customers accounts for most of the sales, and so on. It’s also important to evaluate queries that select sets of rows from skewed columns. Database products differ significantly in how they deal with skew.

Skew in Predicate Evaluation

Predicate evaluation is the act of selecting from a table the rows that satisfy given criteria. In SQL, a single table predicate appears in the WHERE clause and consists of criteria on columns combined with AND, OR, NOT, and so on. An example of a simple predicate in English is “female customers in the United States who own a home.”

To understand the implications of skew, let’s review an example. Suppose your Customer table of 10 million customers is stored 20 rows per page; storing it, therefore, takes 500,000 pages. Remember that 60 percent of customers are in the United States and 2 percent are in France. In this hypothetical table, 50 percent of customers are female and 20 percent of the females own a home.

The predicate “female customers in the United States who own a home” would select six percent (.6*.5*.2) of the customers, or 600,000 rows (assuming the values in the Gender, Country, and Homeownership columns are independent—a subject for another day). By contrast, “female customers in France who own a home” selects .02 percent (.02*.5*.2) of the customers, or 20,000 rows.

If you think of the values in this query as variables, the selection expressions are identical. That is, in SQL, the WHERE clause would be:

WHERE Country=#c#
AND Gender=female AND Homeowner=yes

Thus, only the value of Country changes from one case to the next—the structure of the WHERE clause remains constant. In fact, such a WHERE clause, or predicate, could appear in a stored query, in which the end user supplies only the country name. However, in one case (United States) the predicate selects 600,000 rows and in the other (France) it selects 20,000 rows.

Now think what happens when the user wants the average income of the customers who fit this profile. Recall that the customer table occupies 500,000 pages on disk. In the case of the 600,000 U.S. female homeowners, the selected rows make up, on average, 1.2 of the rows per page. The best strategy is a full table scan of all the pages.

But when it comes to the 20,000 French female homeowners, the query selects far less than one row per page. In fact, only 20,000 out of 500,000 pages could possibly contain a qualifying row. In this case, assuming the necessary indexes exist and can be combined by the optimizer, it is far more efficient to do an index scan, reading 25 times fewer pages than with a full table scan.

So now you can see the nub of the problem with skews: The optimal query plan depends on the choice of values in the query. Query optimization plans in most commercial database products are sensitive to only the choice of columns. Furthermore, most query optimizers don’t collect or use the statistics that make them sensitive to the implications of values within a column. In fact, in most products, the only statistic retained is the number of distinct values in the column (the “column cardinality”). In this situation, the optimizer assumes that no skew is present—that is, that the values are uniformly distributed over the table’s rows.

As you can see from the example, this assumption of uniform distribution can be very wrong. Scanning a full table to retrieve the rows for French female homeowners results in 500,000 disk reads rather than 20,000. Assuming 50 reads per drive per second, this scan represents 9,600 additional drive-seconds of work. Thus, in a system doing single threaded I/O, this scan prolongs query time by 160 minutes. Even if query I/O is done 10 ways in parallel, it adds 16 minutes to the query time—changing it from an interactive query that completes in a few seconds to one that calls for an extended coffee break while waiting for a response. What’s worse, from a system point of view, if the optimizer fails to take skew into account, it occupies the disks unnecessarily for hours.

This issue is critical to major database scalability for the following reason: It doesn’t make much difference when tables are small, and unnecessary full table scans are completed in a few seconds. It matters a great deal, however, when tables are large. What’s more, even though the relationship between increasing table size and increasing query response time is linear in my example, the effect on system work queues can easily become worse than linear in a busy, multiuser environment.

 

Skew in Join Performance

Skew influences join performance in a similar way. Join performance also affects many common queries and is more difficult to characterize briefly, because of the complexity of join processing.

Here’s an example: Suppose you have a billion rows in a transaction table stored with your customer table. Transactions are stored 50 per page, resulting in 20 million stored disk pages. Each transaction row includes a customer ID and product ID. In the transaction table, the product column is skewed. For example, there are more transactions for televisions (20 million) than there are for alarm systems (5,000).

Now consider two example joins. If you ask for the average income of U.S. customers who bought television sets, the system needs to join six million customer rows to 20 million transaction rows. This operation requires the system to read nearly every page of both tables, which a good optimizer will choose to do before probably selecting a sort-merge join.

If you ask for the average income of French customers who bought alarm systems, the optimizer should realize that there are 5,000 alarm systems and 200,000 French customers. Assuming that the product column in the transaction table is indexed, it should probably perform a nested loop join, driving the join from the transaction side. The particular join choice will vary with other factors that are too involved for this short discussion. Some products now process such a case with a join index, which changes the considerations. But, regardless of the join techniques available, the system is susceptible to a large adverse effect if the optimizer is insensitive to skew. In most database products, failure to optimize for the skew with tables the size of my Customers example would result in several million otherwise unnecessary disk reads.

As illustrated in the preceding examples, proper handling of data skew can spell the difference between success and failure in a large database project, particularly a large data warehouse project. If you sense that you may be critically dependent on such a capability or if your database is about to get much larger, be sure to consider data skew as early in the project as you can (see the sidebar “Get Skew Wise”). If you don’t have in-house expertise to understand, assess, and quantitatively resolve issues related to data skew, you should get expert outside assistance. Nobody can afford to do millions unnecessary disk reads for common queries. Nobody can afford to maintain indexes infrequently used because of skew effects. And few can afford to ignore the implications of skew in a large data warehouse.
 

Get Skew Wise

How should you take into account the skew issue in your database projects?

1. First, recognize that it is principally a large database issue; if your largest table contains fewer than half a million rows, you may not have to deal with it directly in your project.

2. Work out some examples in your application to understand whether common queries are significantly affected by the likely degree of skew. Some applications don’t suffer when data is skewed. Data warehouses are not usually among them.

3. If your database is substantial and you can readily see how skew will affect queries important to your users, take some time to define your requirements. Define some common query classes to which skew is important, estimate their frequency in your workload, define your performance requirements, and estimate the distribution of values in your data.

4. If you are selecting a platform for your database, investigate how the candidates handle skew and understand what statistics they keep by column. Also discover when the optimizer takes these statistics into account and how much ability it has to correlate distributions among columns.

5. If your project involves a critical performance or scaling challenge, you need to do some measurement early in the process. Create a benchmark or proof-of-concept and make sure it seriously incorporates the data skew issue.

6. Consider skew in your query performance analysis during all phases of planning, design, and tuning. Analyze the worst likely case.

7. Consider skew in your choice of indexes. The frequency with which an index will be selected and its ability to reduce query time and work are often affected by a column’s values, as well as its cardinality. Unfortunately, much database design advice ignores this point.

Although not mentioned in the preceding examples, skew on a partitioning key is also a common problem that’s much more familiar to most database administrators. Techniques for managing this problem are spelled out in most database engine manuals.