The Two Roles of a Data Warehouse
Most people think of data warehouses as databases that solve reporting problems. However, it’s more useful to think of them as addressing two sets of problems: 1) Reporting, or data distribution, problems and 2) Data integration problems. In fact, Bill Inmon’s original definition of the data warehouse stated that it is “a subject oriented, integrated (implying data integration), time variant, non-volatile collection of data in support of management decisions (implying data distribution)”*. Note: content in parenthesis added.
What you may not realize, however, is that the data structures best suited for data integration are not well suited for data distribution. Data integration works best with narrow, normalized tables. This format makes it easy to work with data at its atomic level, enabling small schema changes when required without causing major headaches. Just like building a model of something (and a database is, after all, a model of reality), you’ll get a more accurate representation of the original version if you start with small things (e.g. atoms) rather than begin with large things (e.g. wooden boards).
Normalized tables, however, don’t always work well for data distribution because they usually require a number of expensive joins to create reports (more on why data warehouses are best modeled in normalized schema in a future post). Instead, data distribution works best with non-traditional data structures.
One example of non-traditional data structures are wide, denormalized tables in forms like star schemas. Other examples include multidimensional databases, like Essbase, and various in-memory kinds of databases like those offered by QlikView and Tableau. These aren’t necessarily great for data integration, but they are wonderful for distributing data quickly.
So why are non-traditional data structures so good at data distribution? Because they anticipate what reporting users want to see, and pre-process the data to support those needs. Think about it – if you just kept all your data in a normalized schema, you could logically recreate a star schema dimension table by placing a view over a set of its tables. However, you would lose the performance benefits of doing that view’s joins in advance, like a true star schema dimension table does.
The Downside of Star Schema Data Warehouses
Now, some extremely bright people have come up with ways to replace normalized data warehouses with star schemas tied together via “conformed dimensions.” It sounds like a nice work-saver but, in actuality, it creates architectures that don’t really support the broader range of functions that a well-designed data warehouse can handle – like serving as a source for master data.
Consider the following:
- SUBJECTIVITY vs. OBJECTIVITY: Star schemas are inherently subjective (e.g. is “department” a dimension OR an attribute of an employee OR both?) Normalized schemata, on the other hand, are less subjective. Since this data will eventually be used in a variety of ways, do you want the core – where all data is integrated – to be subject to the decisions of any particular developer?
- BRITTLENESS: Star schemas are inherently brittle. For example, if you integrate in a star, when you add a column to the source system you have to figure out how to handle all the historical records in the related dimension table. Normalized schemata don’t suffer from those same effects.
In the end, yes – you can shoehorn a normalized schema into the reporting role or shoehorn a star schema into the data warehouse role – but they really serve different purposes and more suitable for other things. As a result, our preferred architecture for big, complex reporting needs is a normalized data warehouse feeding out to denormalized – or alternative technology – tools.
People (somewhat incorrectly) associate Bill Inmon (the father of data warehousing) with normalized data warehouses and Ralph Kimball (the father of the star schema) with star schema data warehouses. When building a BI system that brings together data from multiple sources we frequently recommend “getting to Inmon on the back of Kimball.” What does this mean?
Design, at a high level (i.e. not in detail), a normalized data warehouse (i.e. the Inmon approach). We call this data warehouse the “data integration layer.”
Then, to support your first users, design (in detail) the marts / stars they will need. We call these marts the “data distribution layer”. Simultaneously, create the detail designs of the normalized data warehouse that will hold the data necessary to populate the stars. This is important: don’t design the entire normalized DW in detail – design only that portion required to support the initial needs.
When the next subject area / star comes around, then design it and that part of the integration layer that is necessary to support it.
I’d love to hear your thoughts. To share them, contact me or just respond to this post.