The Data Warehouse ETL Toolkit

cover-(187x247)Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data.

The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions. This book is organized around these four steps.download-buttons-75

The ETL system makes or breaks the data warehouse. Although building the ETL system is a back room activity that is not very visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse.

The ETL system adds significant value to data. It is far more than plumbing for getting data out of source systems and into the data warehouse. Specifically, the ETL system:

  • Removes mistakes and corrects missing data
  • Provides documented measures of confidence in data
  • Captures the flow of transactional data for safekeeping
  • Adjusts data from multiple sources to be used together
  • Structures data to be usable by end-user tools

ETL is both a simple and a complicated subject. Almost everyone understands the basic mission of the ETL system: to get data out of the source and load it into the data warehouse. And most observers are increasingly appreciating the need to clean and transform data along the way. So much for the simple view. It is a fact of life that the next step in the design of the ETL system breaks into a thousand little subcases, depending on your own weird data sources, business rules, existing software, and unusual destination-reporting applications. The challenge for all of us is to tolerate the thousand little subcases but to keep perspective on the simple overall mission of the ETL system. Please judge this book by how well we meet this challenge!

The DataWarehouse ETL Toolkit is a practical guide for building successful ETL systems. This book is not a survey of all possible approaches! Rather, we build on a set of consistent techniques for delivery of dimensional data. Dimensional modeling has proven to be the most predictable and cost effective approach to building data warehouses. At the same time, because the dimensional structures are the same across many data warehouses, we can count on reusing code modules and specific development logic.

This book is a roadmap for planning, designing, building, and running the back room of a data warehouse.We expand the traditional ETL steps of extract, transform, and load into the more actionable steps of extract, clean, conform, and deliver, although we resist the temptation to change ETL into ECCD!

In this book, you’ll learn to:

  • Plan and design your ETL system
  • Choose the appropriate architecture from the many possible choices
  • Manage the implementation
  • Manage the day-to-day operations
  • Build the development/test/production suite of ETL processes
  • Understand the tradeoffs of various back-room data structures, including flat files, normalized schemas, XML schemas, and star join (dimensional) schemas
  • Analyze and extract source data
  • Build a comprehensive data-cleaning subsystem
  • Structure data into dimensional schemas for the most effective delivery to end users, business-intelligence tools, data-mining tools,
  • OLAP cubes, and analytic applications
  • Deliver data effectively both to highly centralized and profoundly distributed data warehouses using the same techniques
  • Tune the overall ETL process for optimum performance