|
|
Intelligent
Enterprise Growing
Pains For a long timemore than two decadesnearly all large commercial databases lived on IBM mainframes within a largely predictable world. I remember participating in a product planning meeting in the early 1980s with the top management of a leading database vendor at that time. Someone suggested that we visualize the environment in which our customers would be operating in three years, then plan our product development so we would be just in the right place when they got there. This suggestion echoes Wayne Gretzky who was once quoted as saying that other players skate to the puck but he skates to where the puck is going to be. The interesting moment in this meeting was when the vice president of marketing responded, I can tell you what the customers environment is going to be in three years: It will be almost exactly what it is today. He argued that for large databases, the market his company pursued, nothing much changes in three years. At that time, only IBM defined the world in which large databases were built: It decided the hardware architecture, operating system, and pricing structure. IBM owned the playing field and set the ground rules under which other companies played. And, as the marketing VP said: If IBM was going to do some major new thing in the next three years, we would already know about it. That, of course, was before the compounding effect of Moores Law, commodity economics, open architecture, and a few related factors took over the world of big systems. These days, three years is enough time for the world to change an awful lot. Recall that just three years ago, the world of e-commerce hardly existed, and for most people, intercompany email was a new phenomenon. So what will happen in the world of scalable databases over the next three years? I talked to friends and contacts in the industry (partly via email, of course), mused about this question over a beer, and came to the following conclusions.
Database SizeI have been going on record about this question for about four years now, so I thought it might be interesting to bring you a prediction from Jim Gray of Microsoft: In three years, every substantial company will be able to afford a terabyte of storage (at $25,000), the midrange will be at 10TB and the high end will be 100TB. By 2001, according to Gray, raw disk will cost a penny per megabyte. Our survey results confirm that 100TB cant be too far away. As Gray says, Very few vendors seem to grasp the implication of these shifts. Our software base needs a complete overhaul. And when he says our software base, he means the industrys as a whole, not one particular vendor. I think hes right. Most of todays popular data handling software was designed when a gigabyte was large for a database. Ten years ago, 90 percent of CIOs couldnt imagine why anyone would want to spend the money to store 100 bytes of information about each of a million customers online. Why not put it on tape? they would say. How often are you going to bother searching such an enormous file, anyway? And 10 years ago, very few relational database products could have handled a query against that million-row, 100MB table efficiently. Within three years, however, we will have databases nearly a million times larger for many of those same products to search and manage. Im a believer in the ability of software vendors to adapt their products to significant changes in requirements and technology, given enough time and money. But can anyone really architect a database engine for scale x and then have it work optimally on a database of scale x million? As the tallest peaks in the database Himalayas get exponentially higher each year, I think that fewer database engines will be up to making the climb. (Remember the little red engine that could: I think I can, I think I can.) If, as a database vendor, you want your database engine to be good at dealing with 100TB in three years, youd better have a plan all worked out, and youd better have a lot of resources in place to execute it now. Perhaps we will see some visible VLDB dropouts in the next three years. Those who stay the course are going to have some tough problems to solve.
Parallel ArchitectureIn this world of increasingly larger databases, the engine has to be designed for a much higher degree of parallelism than we see in practice today. As Gray points out, it takes 1.2 days to scan a terabyte at 10MB/sec. We are moving to fiber channel disk systems and incrementally faster devices, but scanning a terabyte of data still takes a very long time if you dont do it in parallel. Todays most highly parallel production systems will do scans in parallel across a few hundred processors. Such highly parallel operations occur on Wal-Marts NCR Teradata system and a few other places, but most of todays VLDB work involves queries that are being parallelized over a few tens of processors. We need a quantum advance to do a good job on 100TB databases. So far, vendors havent said anything very convincing (at least not publicly) about how this is going work well at 10TB, let alone 100. Then, at the low end, where people have a mere terabyte online, virtually everyone will need a parallel database engine. Even the low-end DBMS server products are going to have to learn to do parallel architecture with integrated parallel optimization well.
Commodity 64-Bit Architecture and ClusteringIn 1999, well start to see 64-bit Intel architectures lower the price levels for very fast processors and very large memorieswhich will mean fast, powerful, inexpensive SMPs and MPPs with memory capacities that are enormous compared to what we use today. At the same time, in the period from 1999 to 2000, larger scale, inexpensive clustering technology will become widely available. There is the potential here for NT-based servers to enter the arena of much larger, more highly available databases. Unix will also be advancing in these dimensions, but it seems likely that NT solutions will take a big bite out of the VLDB market during the next few years. My guess is that most databases of under 250GB created after 1999 will be on NT platforms. According to Rob Holbrook, vice president of decision-support solutions at Compaq, the industry will experience dramatic, nonlinear growth in the size of the largest database systems deployable. This will be due to the adoption of 64-bit architectures, which enable the effective use of larger SMP systems and the evolution of clustering technology as a way to tie large numbers of these SMP systems into a single processing complex. Holbrook says that Compaq is going to drive the cost of VLDB implementations down dramatically while enabling larger databases than those that can be built today. Since the Tandem acquisition (and even more so after the Digital deal), Ive been expecting Compaq to unfold a scalable database strategy. It will be interesting to see to what extent Compaq integrates Tandems technology and know-how concerning large, reliable databases into its strategy. ServerNet, NonStop Kernel, and NonStop SQL technology could play a large role as Compaq pursues larger-scale systems.
Ultradense DiskAs if the ordinary disks weve been using for the past decade werent falling in price enough, a new generation of disk technology is also at hand. Using optically assisted Winchester technology, ultradense disks will store 100GB on a 5.25-inch drive and have a total cost of operation lower than that of tape silos. These extraordinary gains in capacity and cost of operation are achieved with higher density recording in which lasers are employed to support the recording of smaller, more closely spaced bits. Ultradense disks, therefore, provide an attractive option for VLDB backup, replication, archiving, and so on.
Enterprise Storage NetworksIn todays world, you cant connect more than about 30 hosts to a shared pool of storage and, using SCSI technology, the maximum distance between a host and a storage device is 25 meters. With a new architecture introduced by EMC as the enterprise storage network, it is now practical to connect hundreds of hosts to a shared pool of devices from as far as 500 meters away. That distance will increase to 10 kilometers in early 1999. Storage networks enable a new approach to widespread sharing of large volumes of storage and, by implication, large amounts of data. Smart storage networks are going to take over more of the job of managing data and enable new enterprise data management architectures. Leading user organizations are moving toward enterprise storage strategies and enterprise data architectures in which the total collection of data in the enterpriseall types of files and all types of dataare viewed and managed as one enterprise resource. These strategies contribute to a new emphasis on enterprise management of metadata, business rules, data transformation, data quality, and many related issues.
Enterprise Data ManagementOddly enough, it seems that the notion of the integrated corporate databasethe holy grail of the late 70s and early 80sis nearly forgotten today. Many forces have operated to fragment the data across a large enterprise over the past decade: the success of client/server architecture, the adoption of off-the-shelf applications with their own databases, the continued dependence on large numbers of stovepipe legacy systems, the proliferation of independent data marts, and the availability of cheap storage. The data of most enterprises is scattered over countless servers and systems, resulting in a continually growing obstacle to data integration. The need for integrated data for decision support has driven the data warehousing boom, but the data warehouse only solves part of the problem. Companies need to integrate their operational data to implement the strategies enabled by their data warehouses. The industry hasnt offered solutions for the integration of operational data, but integrated operational data is going to get increasingly important over the next three years. Enterprise storage networks address some key issues in access, physical data sharing, and storage management, but the next step to genuine data integration is a big one over which many companies will stumble if vendors dont begin to develop off-the-shelf solutions.
Multimedia DatabasesAll my friends in the vendor communityboth hardware and software typestell me that those 100TB databases are going to have a lot of textual, image, and other multimedia content. Jim Gray argues that youll have a hard time finding even 100TB of transactional data in most enterprises. But Im betting that well continue seeing primarily tabular data for the next three years. Sure, we will see more and more text, image, and other interesting new stuff integrated into databases; and that phenomenon is worth watching. But most of my clients dont seem to be racing to put their images right inside their giant databases. What I see them doing instead is linking image data to structured databases, building integrated applications to make it look seamless, and coming up with more tabular data than you might ever imagine. The hotter area this year, it seems to me, is mining the total history of customer interaction with the company. That is, companies are going beyond the transaction history to include other information about the customers interaction with the company: information about every customer inquiry, every interaction with customer service, every page visited in an online store, every key pressed using the automated voice response system, and so on. You can rack up a lot of terabytes with that stuff, even before you get to recording and searching the voice records of the calls to the call center. Of course, multimedia databases are also an exciting development and seem to be at the center of endlessly fascinating applications. The complete medical record includes X-rays, MRI scans, CAT scans, EKGs, notes dictated by doctors, and the list goes on. When these things are all integrated into a persons lifetime medical record, its going to be a whopping big database. But typically, the focus seems to be on the more fundamental issues of data integration and data quality as well as successful management within the database of the structured data. It will be interesting to see whether the maturation of object/relational technology over the next three years will result in many multimedia implementations for large-scale systems. The capabilities database vendors are creating in relational extenders, cartridges, and DataBlades for image data are intriguing. Most people in the VLDB arena, however, need to be convinced that these things can scale.
Scalable Databases in 2001In just three short years, we have the prospect of yet another quantum leap in storage technology, raw disk at a penny a megabyte, and terabyte databases nearly everywhere. We have the challenges of managing them, and we hope the database vendors will be ready. We should see at least a few massive systems handling a hundred terabytes or more and virtually universal use of parallel database architecture. These advances do indeed seem to suggest a qualitatively different world from the one in which we live in today. If you are involved with scalable databases, it might be a good idea to take long vacation right after you get past that Year 2000 crisis. Theres going to be a lot to do when you get back. |