One of the many things that delights me about PowerPivot is the central role played by the Star Schema. Those of you reading with a data-warehousing background would shrug your shoulders and say: “So what, what else would you expect to find at the core of a BI tool?”.
Those from an Excel PivotTable background would ask: “What’s a Star Schema, why do we need one,what’s wrong with a the good old-fashioned single flattened table?”.
A quick Q&A is perhaps the best way for me to explain why star schema design is a powerful skill in a datasmith’s toolset.
First off, what’s a Star Schema?
A Star Schema (also know as the dimensional model) is a denormalised (flattened) data model used to simplify an operational (OLTP) data model to better accommodate reporting and what-if analysis.
At its simplest, it consists of a central fact table with links back to a “surrounding” set of dimensional tables, hence the star name. A variation is the snow-flake schema, where the dimensional tables are not fully denormalised (e.g. Product Category->Product->Fact instead of Product->Fact).
The role of the fact table (besides being the table that hosts most of the measure fields) is to create linkages between dimensions (such as Customer, Product, Date) usually based on an actual transactional event (e.g. Invoice) or a proposed event (such as a Budget or Forecasted Sale). In effect, simplifying the often complex work-flow-driving connections of a typical operational system by using a single many-to-many relationship (modern ERP/CRM systems’ data models consist of scores of configurable many-to-many relationships).
Many wrongly believe the star-schema was adopted for performance reasons and now that in-memory OLAP is becoming the norm it’s no longer necessary to use dimensional modelling techniques. In fact, in the early days of data-warehousing, RDBMs had great difficulty efficiently handling star-queries (and some such as MySQL and SQLite, still do).
The original primary purpose of the star schema was to simplify the SQL required to access reporting data; to make the model more approachable to non-technical users. Of course, even simple SQL was beyond the knowledge or interest of most end-users but a sizeable proportion were happy to do so (often helped by SQL “generators” such as MicroStratery or Business Objects). But even in situations where SQL-wielding civilians were not to be found, the simplicity of the dimensional models proved to be a valuable aid when establishing and developing the warehouse data requirements. PowerPivot requires no SQL knowledge to manipulate the dimensional model which brings the original concept full-circle but this time opening its possibilities to a much wider audience.
But surely, concentrating on the actual reports would be a more valuable requirements gathering exercise?
A so called “bottom-up approach” is often the best way to approach a reporting request particularly if the reports are simple one-off “traditional” reports. But for self-service BI, this needs to be combined with a top-down dimensional design. The idea is not to build out each and every report or indeed cube but to build a structure that’ll support likely queries. The process of building a star schema provides both a logical model and a physical implementation of that model against which potential queries can be tested. I’ve worked on several POCs destined for implementation in Essbase where the star-schema was built and potential cubes mocked up using Excel PivotTables that subsequently never went any further (except for the star-schema ETL process). The end-users derived sufficient value from the denormalised star-schema pivoted and reported in Excel.
In traditional ROLAP data-warehouses where the cubes were built directly against star-schemas, the pure logical approach to the data model often had to take a back-seat to the necessity of fine-tuning it to make response times (be that ETL or user-pivoting) acceptable. This is why I much preferred situations where the star acted as a logical model from which MOLAP cubes were built.
With PowerPivot, ROLAP has a new champion. The column-oriented high-compression in-memory architecture means that the compromises of the past are no longer necessary. The fact table reverts back to it primary role as a many-to-many connector. In a pure hypercube, measures are just another dimension (the approach that Palo takes), this is also now true of PowerPivot models; measures can be sourced from dimension tables and dimensions from fact tables as it logically should be, but without the performance hit of old.
But what’s the advantage of a star schema over a flattened table when using PowerPivot?
It is true that the same flattened table model as used to backend a PivotTable can be used within PowerPivot. But doing so would limit the potential of the DAX language to construct measures such as average sales spread over potential customers (rather than actual customers that would typically be represented on a flattened table). Also, by creating “conformed dimensions” (single cross-business views of Customer, Product etc.) and using such tables as dimensional sources for multiple fact tables, “virtual cubes” that combine values from multiple fact tables can be built.
If you’re new to dimensional modelling I’d recommend the books & articles of Ralph Kimbal as good starting point. You do have to be aware that some of the advice regarding efficiency trade-offs, surrogate keys etc. do not apply in a PowerPivot scenario (even though other performance issues still apply) but the logical design tips still apply.
Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.