Winter Corporation


WinterCorp

411 Waverley Oaks Road

Waltham, MA 02452

Phone: 781.642.0300

Fax: 781.642.7222

 

Contact us | Privacy

      

Winter Corporation VLDB News

Intelligent Enterprise
January 1999

Indexing Goes a New Direction

Richard Winter

IBM has invented and patented a significant new data structure called the encoded vector index (EVI), which promises major gains in the evaluation of predicates on large tables in DB2. This feature increases the performance and scalability of the large data warehouse in DB2, doing a better job than uncompressed, stored bitmaps would on some queries.

IBM (www.ibm.com) delivered EVI in DB2 for OS/400 version 4 release 3 in August. I expect to see it in DB2 for AIX, Windows NT, and OS/390 in the next 12 to 18 months.

FIGURE 1 Concept of a bitmap index (uncompressed).

You can think of the EVI as a variation on the bitmap index concept. As shown in Figure 1 (page 72), a bitmap index indicates whether a specific value exists for each row in a particular column. One bit represents each row. Thus, in the bitmap index for the value blue in the column color, the nth bit equals 1 if the nth row of the data table contains color = blue, or 0 if that row holds a value other than blue.

FIGURE 2 An encoded vector index (EVI) for the color column.

An EVI serves a similar purpose, but only one index is necessary to account for all the values occurring in the column, as shown in Figure 2 (page 72). Each position in the EVI corresponds to one cell on the indexed column. So, in an EVI on the color column, the nth position of the EVI identifies the value of color in the nth row of the table.

The system stores a bit code in each position on the EVI. In principle, the length of that bit code is just sufficient to represent all the values uniquely. Thus, if seven distinct values occur in the column, they would be encoded in strings of three bits. To represent these seven values, the system would choose encodings such as 001, 010, 011, and so on, up through 111.

Suppose you have a table about cars with a row for every car registered in California — a total of, say, 10 million cars. Given a row size of 500 bytes, the result is a medium-sized table of 5GB. The table has columns for attributes such as manufacturer, model, year, and color.

Let’s suppose the manufacturer column contains 200 distinct values. The system, then, would assign an 8-bit code to each value. Thus, an EVI for the manufacturer column of our example cars table is 10 million bytes long, with a one-byte position for each row. The same data represented in the traditional DB2 index, referred to as a radix, would require about four times as much space.

Also stored with the EVI is a lookup table. The lookup table maps the relationship between column values and codes and records the frequency of each value and the row identifier (RID) of the first and last occurrence of each value.

EVIs in Queries

The query “How many cars are blue?” illustrates the simplest use of the EVI structure: The query engine retrieves the answer from the EVI lookup table, presumably with a single disk access.

To answer the query “List the Studebakers,” the system uses the EVI vector itself. It retrieves the code for Studebaker from the lookup table and scans the EVI for matches. The position of each match corresponds to the RID of a row containing Studebaker in the manufacturer column. Because the vector is compact, such a scan is usually faster than a conventional index scan.

The method works equally well for range conditions. The system uses the lookup table to identify the codes of the values that fall within the range. Then a single scan of the EVI is sufficient to get the corresponding RIDs.

Multicolumn Predicates.

Previous versions of DB2 already incorporate dynamic bitmaps to increase the efficiency of evaluating predicates on multiple columns. EVIs thus become an adjunct to the use of these dynamic bitmaps in queries such as, “How many cars are blue or navy and made by Ford, General Motors, or Chrysler between 1950 and 1995?”

FIGURE 3 Evaluation of a multicolumn predicate with EVI.

To answer this query using our example EVIs for color, manufacturer, and year, the system would proceed, as illustrated in Figure 3 (page 73), through the following algorithm:

  1. Read each EVI, converting it to a bitmap representing the matching value or values for that column. Thus, we get bitmaps for:
    • manufacturer = Ford or General Motors or Chrysler;
    • color = blue or navy; and,
    • year is between 1950 and 1995.
  2. AND the three bitmaps to produce the bitmap representing the result of the query. The query engine counts the 1 bits in the resulting bitmap to produce the answer.

The predominant cost in evaluating this query is that of reading the three EVIs, which we can estimate roughly at 10MB each (if between 128 and 255 values exist in each of the columns color, manufacturer, and year). Thus, the system must read 30MB of EVI.

Conventional Optimization Strategy.

Several database engines use a maximum of one index per table in evaluating a query. The optimizer asks whether the most selective index will eliminate enough data blocks to make an index scan more efficient than a full table scan.In our example, none of the individual column conditions is selective enough, so these engines would simply do the full table scan, resulting in 5GB of disk I/O. Such heavy disk access is 150 times slower than the EVI-based technique I described.

Combining Radix Indexes.

Previous versions of DB2 on the AS/400 as well as current versions of DB2 on AIX, NT, and OS/390 would use — and combine — radix indexes for this query. With the radix index, the query engine constructs a list of RIDs for each value of each indexed column. Thus, if the table contains one million Fords, the radix index will contain one million RIDs for manufacturer = Ford. Let’s postulate also roughly one million cars each made by General Motors and Chrysler. If we further posit about one million blue cars, one million navy cars, and seven million cars made between 1950 and 1995, we get a total of 12 million RIDs in the three indexes to be combined. Using the AS/400 RID size, 4 bytes, we get 4312 = 48 million bytes of radix indexes to process — about 1.6 times the disk I/O required to do the job with EVIs.

Stored Bitmap Indexes.

A bitmap index would exist for each color, each manufacturer, and each year from 1950 through 1995. The result is 50 bitmaps of 1.25MB each — about 63MB if they are not compressed (as with Informix or Red Brick) and considerably less if they are compressed (as with Oracle, Model 204, or Sybase IQ). Uncompressed bitmaps thus require about twice the I/O of EVIs in this example.

Efficiency Comparisons.

It is easy to construct examples in which EVIs would be less efficient than even uncompressed stored bitmaps. For example, if you want to count just the blue Fords in the database described previously, the database engine would need to read 20MB of EVIs but only 2.5MB of bitmaps. This 8-to-1 advantage in favor of the bitmap index is significant.

However, as the number of values per column involved in the query increases, this margin diminishes and eventually tilts toward the EVI. Range conditions require a bitmap reading for each value spanned by the range. So, the condition “between 1950 and 1995” implies 46 values and therefore 46 bitmaps. The EVI for year of manufacture would use at most a byte per position. Reading an EVI of default width is roughly equivalent to reading eight bitmaps. For the year column in this query, then, EVI processing requires roughly 8/46 of the disk I/O required by bitmap processing. In addition, one scan of the EVI is probably more processor-efficient than ORing 46 bitmaps. Compressed bitmaps would require less I/O but more processing. At this point, I see no simple way of comparing their behavior to EVIs.

Maintenance and Storage

In the current implementation, EVIs are eight bits wide by default. Presumably this means they can represent up to 255 values plus the NULL value. Data Definition Language (DDL) permits you to specify the maximum number of values allowed in the index, so that you can choose widths other than eight bits. For example, in a field that contains at most eight distinct values, the EVI would need to be only three bits wide.

One undesirable side effect of EVIs may arise when the diversity of values in a column increases. Given enough new values, the system must delete the EVI and rebuild it, creating codes with more bits at every position on the vector. This maintenance issue doesn’t exist for bitmap indexes; new values simply result in the creation of a new bitmap without affecting existing ones.

Still, uncompressed bitmaps occupy much more storage than the corresponding EVIs for fields with more than a few distinct values. For a field with 255 distinct values, as in our example cars database, the EVI requires 10MB of storage. The corresponding uncompressed bitmaps, however, require 25531.25MB, which equals about 319MB of storage — just under 32 times as much. (Bear in mind that Oracle, Sybase IQ, and Model 204 compress bitmaps quite dramatically and would do much better.)

Prognosis

These are early days for EVIs. A recent invention, only one commercially available product — DB2 for OS/400 — has implemented them, and they are not yet widely used.

Still, it looks as if EVIs will have a substantial, positive effect on the accessibility of data in large data warehouses. They benefit performance about as well as uncompressed stored bitmaps do, yet appear to be useful on a somewhat broader class of queries. Like bitmap indexes, EVIs will be many times more efficient than table scans and other traditional relational database techniques for many queries of practical importance. As with bitmap indexes, it is easy to construct practical examples in which EVIs would outperform traditional techniques by a factor of 10, 100, or even more.

To DB2 users, EVIs will represent a significant advance in performance — and one that changes the way DB2 data warehouses are designed — for the better.