Winter Corporation


WinterCorp

411 Waverley Oaks Road

Waltham, MA 02452

Phone: 781.642.0300

Fax: 781.642.7222

 

Contact us | Privacy

      

Winter Corporation VLDB News

DB2 Magazine
Q1 2002

Playing Every Tune: Federated Databases

Skeptical of the high price tag and long lead times associated with centalized data warehouses? Federated databases let you query data wherever it lives, whatever the format.

By Richard Winter

"Let's get all the data we need about our customers and bring it together in one place." That's one of the most common project concepts you'll find in large companies today.

When a company sets out to build the decision-support or customer relationship management (CRM) system that will make all customer data accessible, most people picture a scenario that works like this: A user enters a query about a customer. That query goes to one computer where all the customer information has been stored in a single database. The system searches that one database and returns the answer to the user.

But it doesn't have to be that way. The customer data is probably already around somewhere in a company's existing systems. Instead of duplicating that data in a new centralized system, the company could leave the existing data in place. Then, when the user asks a question, the system gathers the appropriate data from wherever it is stored and answers the question. This approach — leaving the data in place and retrieving it on demand — is called a federated database.

Until recently, the centralized approach represented the more attainable of two limited choices for building CRM systems: a massive initial investment of time and money to build and load a centralized database or a complicated patching together of existing systems with all the associated integration and maintenance problems.

But advances in technology and product capabilities have made the federated approach a practical option, and one that more and more companies are likely to choose.

DB2 Universal Database (DB2 UDB) contains a remarkably complete set of capabilities to support federated databases, which I'll summarize. First, let's look at how a federated database works and the trade-offs between federated and centralized approaches.

WHAT IS A FEDERATED DATABASE?

A federated database is a collection of data stored on multiple autonomous computing systems connected by a network that is presented to users as one integrated database.

A federated database presents data to users by means of user views that, from a user perspective, look exactly like the views of data in a centralized database. User views in a federated database are mapped to underlying tables or objects that may be stored on any of the various databases in the federation. The federated database system processes queries defined in terms of these views, retrieving data as needed from the other systems in the federation, and delivering the results as if all the data was local.

Anyone who uses the Internet takes the delivery of content from multiple systems for granted. Almost every Web page on a major commercial site is assembled automatically from multiple sources. Click on a button and data may be retrieved from other databases and servers. Users may not be aware of this, but it goes on constantly.

Federated databases operate by means of a similar principle, except that each data resource is defined by means of a database schema or view, and the user has much more power to access and manipulate the data. Users interact with a front-end system that presents information on what data is available and accepts queries. The front-end system decomposes each query into subqueries that can be handled by the various database servers in the federation, ships each subquery to the appropriate database server, and assembles and delivers the result to the user.

Note that a federated database system will deliver acceptable performance for most applications only if it includes a fully developed cost-based optimizer that is aware of the distribution and heterogeneity of the back-end servers. Some older products that appear to deliver federated database management are really only providing gateways to other databases. Such products may provide command and data translation, but they aren't capable of handling most queries that access data at multiple locations.

Both the servers and the data models in the federation may be heterogeneous. The federation can include different database engines residing on different operating systems and hardware platforms. (For example, a federation may include DB2 databases on Sun Microsystems and IBM servers, Informix databases on Hewlett-Packard servers, Microsoft SQL Server on a Unisys ES7000, and a Teradata database on an NCR Worldmark 5250.) Database servers can be added to and deleted from the federation over time.

In terms of data models, the front-end system may present users with a relational view while retrieving data from database servers that are object oriented, hierarchical, and so on. Different users or applications may be presented with different views of the data suited to particular uses.

Multiple copies of the front end can be distributed throughout the network. Front-end systems may maintain snapshots (materialized views or summaries) of data in various servers. Such snapshots have several advantages, including accelerated query performance. And the snapshot can be available even when the underlying data is not. Snapshots also provide a stable picture of a dynamic phenomenon for analysis and reporting purposes.

FEDERATED VS. CENTRALIZED DATA WAREHOUSING

In principle, any multisource database can be implemented in either a federated or a centralized architecture.

A major obstacle in the centralized approach is that it requires a massive investment before any benefit is realized. The usual process includes:

  • Developing a concept and data model for the collection of data needed to support business decisions
  • Identifying useful data sources from which to obtain the data
  • Acquiring a database server platform to the host the data warehouse
  • Extracting data from the sources
  • Cleaning, correcting, and transforming the data to common formats and semantics
  • Designing an integrated database to hold the data
  • Loading data into the data warehouse
  • Developing systems to repeat most of this process periodically to keep the data current.

Only after these eight expensive, time-consuming steps occur can you start implementing the applications and processes that will realize value from the investment. The sobering part of this proposition is that you invest the millions of dollars in building the centralized data warehouse before you can answer one single business question.

With the federated approach, the economics are different and can be extremely advantageous. You don't buy the massive database server and you leave all your data where it is. You don't need to build systems to update the database. You still perform the first three steps, but the integrated database is virtual. It exists only in the form of the views presented to the users. When a user submits a request for data, the federated database system performs extraction, cleansing and transformation, and the equivalent of loading for only that data needed to satisfy the user's request.

Instead of a massive database server, you acquire a midrange server or several smaller servers to host the front end. Instead of creating large batch extract, transform, and load processes for periodic update, you write data definitions (which can include stored procedures) to map data in the federated servers to the integrated views that will be presented to users and applications.

Suppose you have 50 terabytes of data scattered over 25 different systems. Once a week you have a question that requires data from several of those sources. With the centralized approach, you'd make a huge investment to bring it all together ahead of time — and it might never pay off. It would be more efficient to simply get the data when you need it.

With the federated approach, you can answer a question about one customer by doing the work on just that one customer's data. So, although there is some up-front investment, a major portion of the cost is incurred one query at a time.

Now, there are some trade-offs. Once the centralized data warehouse is built, it is far more efficient. If you are going to query and analyze the data frequently or intensely enough, the centralized data warehouse will be more economical in the long run. The centralized data warehouse is far simpler to manage and operate because the data is all in one central system. And, the centralized data warehouse has a performance advantage that ranges from modest to great depending on the query.

But the federated approach has some unique advantages. You use the data in place rather than replicating it in a central system, so it's is never out of date or synchrony. And, you don't have wait until data on all customers has been extracted, transformed, and loaded before you can start doing useful work.

Of course, federated databases have some disadvantages, too. Some queries are inherently costly (because they involve moving large amounts of data between systems) or difficult to optimize. And the federated systems are presumably not all under the control of the data warehouse owner, so they may give federated decision-support requests a lower priority than other tasks.

DB2 AND FEDERATED DATABASES

DB2 UDB contains some features that enhance the value of federated systems. The DB2 engine itself serves as the front end for a federated implementation. DB2 interfaces with users and applications and provides views over underlying data whether it's stored in a local DB2, a remote DB2, or another database system (including Oracle, Informix, Sybase, Microsoft SQL Server, and Teradata). The technology that makes this possible, called DB2 Relational Connect, represents a subset of the full functions available in DB2 DataJoiner, a separate product. As an optional feature of DB2 UDB, DB2 Relational Connect provides read-only support for distributed heterogeneous queries. DB2 DataJoiner provides complete heterogeneous data management, with INSERT, UPDATE, and DELETE added to query capabilities.

DB2 with DB2 Relational Connect provides major capabilities for the implementation and operation of federated databases, including:

  • Optimization of queries that involve access to distributed, heterogeneous database servers
  • System maintained summaries — via Automatic Summary Tables (ASTs) — of remote data on the front-end DB2 system. ASTs are ordinarily not visible to the user. When a user writes queries in terms of views defined on the underlying tables, DB2 automatically goes to the AST on the front-end server, rather than accessing the federated back-end systems, whenever it determines that using the AST is more efficient. ASTs boost performance without complicating the user view.

These sophisticated performance capabilities are crucial to making federated databases practical. Large-scale federated databases in production use have been showing up in the Winter Database Scalability Program survey for several years. About 5 percent of respondents to the most recent survey were in production with large-scale federated databases. Transaction processing systems were more common than data warehouses, but both were present. In our most recent survey, one system was managing several terabytes of data in a federated DB2 data warehouse distributed over IBM and Sun servers.

STATE OF PRACTICE

The Internet has made large federations of content servers widespread. Distributed application architectures built around Web servers and many cooperating databases are becoming common within and between enterprises. Federated databases are another step forward in these distributed architectures: They offer the full power of SQL while hiding distribution and heterogeneity from users and applications.

Because federated databases require less up-front investment and less data duplication than the centralized approach, they provide data integration with "pay as you go" economics. In the quest to advance business goals via data integration, companies will increasingly turn to federated database architectures.