Star Schemas – to boldly go where no Excel spreadsheet has gone before


One of the many things that delights me about PowerPivot is the central role played by the Star Schema. Those of you reading with a data-warehousing background would shrug your shoulders and say: “So what, what else would you expect to find at the core of a BI tool?”.

Those from an Excel PivotTable background would ask: “What’s a Star Schema, why do we need one,what’s wrong with a the good old-fashioned single flattened table?”.

Those from a classic MOLAP background (Essbase, TM1, Palo) might also ask: “Why do we need this extra layer? Load the cube directly from the operational data model and get on with it!”.

A quick Q&A is perhaps the best way for me to explain why star schema design is a powerful skill in a datasmith’s toolset.

First off, what’s a Star Schema?

A Star Schema (also know as the dimensional model) is a denormalised (flattened) data model used to simplify an operational (OLTP) data model to better accommodate reporting and what-if analysis.

At its simplest, it consists of a central fact table with links back to a “surrounding” set of dimensional tables, hence the star name. A variation is the snow-flake schema, where the dimensional tables are not fully denormalised (e.g. Product Category->Product->Fact instead of  Product->Fact).

The role of the fact table (besides being the table that hosts most of the measure fields) is to create linkages between dimensions (such as Customer, Product, Date) usually based on an actual transactional event (e.g. Invoice) or a proposed event (such as a Budget or Forecasted Sale). In effect, simplifying  the often complex work-flow-driving connections of a typical operational system by using a single many-to-many relationship (modern ERP/CRM systems’ data models consist of scores of configurable many-to-many relationships).

Many wrongly believe the star-schema was adopted for performance reasons and now that in-memory OLAP is becoming the norm it’s no longer necessary to use dimensional modelling techniques. In fact, in the early days of data-warehousing, RDBMs had great difficulty efficiently handling star-queries (and some such as MySQL and SQLite, still do).

The original primary purpose of the star schema was to simplify the SQL required to access reporting data; to make the model more approachable to non-technical users. Of course, even simple SQL was beyond the knowledge or interest of most end-users but a sizeable proportion were happy to do so (often helped by SQL “generators” such as MicroStratery or Business Objects). But even in situations where SQL-wielding civilians were not to be found, the simplicity of the dimensional models  proved to be a valuable aid when establishing and developing the warehouse data requirements. PowerPivot requires no SQL knowledge to manipulate the dimensional model which brings the original concept full-circle but this time opening its possibilities to a much wider audience.

But surely, concentrating on the actual reports would be a more valuable requirements gathering exercise?

A so called “bottom-up approach” is often the best way to approach a reporting request particularly if the reports are simple one-off “traditional” reports. But for self-service BI, this needs to be combined with a top-down dimensional design. The idea is not to build out each and every report or indeed cube but to build a structure that’ll support likely queries. The process of building a star schema provides both a logical model and a physical implementation of that model against which potential queries can be tested. I’ve worked on several POCs destined for implementation in Essbase where the star-schema was built and potential cubes mocked up using Excel PivotTables that subsequently never went any further (except for the star-schema ETL process). The end-users derived sufficient value from the denormalised star-schema pivoted and reported in Excel.

In traditional ROLAP data-warehouses where the cubes were built directly against star-schemas, the pure logical approach to the data model often had to take a back-seat to the necessity of fine-tuning it to make response times (be that ETL or user-pivoting) acceptable. This is why I much preferred situations where the star acted as a logical model from which MOLAP cubes were built.

With PowerPivot, ROLAP has a new champion. The column-oriented high-compression in-memory architecture means that the compromises of the past are no longer necessary. The fact table reverts back to it primary role as a many-to-many connector. In a pure hypercube, measures are just another dimension (the approach that Palo takes), this is also now true of PowerPivot models; measures can be sourced from dimension tables and dimensions from fact tables as it logically should be, but without the performance hit of old.

But what’s the advantage of a star schema over a flattened table when using PowerPivot?

It is true that the same flattened table model as used to backend a PivotTable can be used within PowerPivot. But doing so would limit the potential of the DAX language to construct measures such as average sales spread over potential customers (rather than actual customers that would typically be represented on a flattened table). Also, by creating “conformed dimensions” (single cross-business views of Customer, Product etc.) and using such tables as dimensional sources for multiple fact tables, “virtual cubes” that combine values from multiple fact tables can be built.

If you’re new to dimensional modelling I’d recommend the books & articles of Ralph Kimbal as good starting point. You do have to be aware that some of the advice regarding efficiency trade-offs, surrogate keys etc. do not  apply in a PowerPivot scenario (even though other performance issues still apply) but the logical design tips still apply.

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

 


Advertisements

14 responses to “Star Schemas – to boldly go where no Excel spreadsheet has gone before

  1. Pingback: Spreadsheet Add-Ins: Star Schemas in PowerPivot – Spreadsheet Day

  2. Pingback: Basic Data-Warehouse using Excel « Excel Tips Monster

  3. Pingback: Power (Pivot) to the People « Gobán Saor

  4. Pingback: PowerPivot – Show Detail not allowed! « Gobán Saor

  5. Pingback: Micro ETL in the PowerPivot age « Gobán Saor

  6. Another HUGE benefit of star schemas over single flat tables in PowerPivot: compression. Compression is MUCH better in PowerPivot as you reduce the number of columns in your large tables. Storing all those extra attributes in your dimension tables, and only having a single (preferably integer) linkage column in the fact table can lead to dramatically smaller files.

    Also, it should be noted that PowerPivot query performance is often (but not always) better as your compression rate improves. Smaller files also tend to result in faster queries. Star schemas – don’t leave home without ’em 🙂

  7. Rob,

    Yes this does appear to be the case which I’ll admit surprised me as I would have though the columnar nature of the Vertipaq engine would have compensated for the resulting wide rows of a flattened fact table. I think the main trick is, if you do denormalise something make sure the resulting additional attributes have the lowest cardinality possible i.e. don’t include high cardinality columns such as Order No, Address Line 1 etc.

    I’ve not yet fully got my head around Vertipaq, it would be nice if MS published a bit more technical details about it. In the meantime we must depend of studies such as this http://sqlblog.com/blogs/marco_russo/archive/2010/02/04/table-denormalization-study-in-powerpivot.aspx and experiences of people like yourself in the field.

    It also appears that a star schema design has processing advantages as well, obviously those who wrote and optimised the DAX/MDX to Veripaq queries had this data model as their primary use case.

    Tom

  8. Pingback: Excel in Data | Gobán Saor

  9. Pingback: Death of the Star Schema? | Gobán Saor

  10. Pingback: There’s Cling-ons on the Star-Schema bow, Jim. | Gobán Saor

  11. Pingback: There’s Cling-ons on the Star-Schema bow, Jim. | Gobán Saor

  12. I am struggling with the useability of Powerpivot when used with a star schema. I am quite new to Powerpivot, but one of the first things that struck me is that only one relationship between two tables can be active at a time. How does this work at all? I have half a dozen relationships to my date table, and a few double-relationships to some other dimension tables as well.
    I have found out about USERELATIONSHIP, but my aim is to create a simple Powerpivot model out of a simple star schema, where the end user does not need to do much DAX. I chose to add a simplified datawarehouse inbetween the operational database and Powerpivot, but I am not sure on how to lay that out, now that I have discovered the limitation of multiple relationships in Powerpivot.

    Thankful for any hints or wisdom,
    Toby

    • Toby,

      Classic dimensional models (as developed for the DWs of the 90s & early 00s) didn’t support multiple relationships either,usually requiring a re-phrasing of the model to get around it (example: changing the value of the date key as the order-to-cash process continued from order-date to due-date to delivered-date to invoiced-date with measures providing counts of days from order-to-deliver etc.) I come from that tradition so I don’t find the restriction too much of a problem..

      Using DAX in any case (and yes I know you wish to reduce the amount of DAX users are exposed to, but if you do you must apply strict old-fashioned star-schema design) it is possible to model against multiple relationships (and many-to-manys) using “calculated relationships”. The fixed relationships just make things easy but do not stop on-the-fly relationships being conjured up at run-time (a bit like SQL in fact).

      Tom

  13. Thanks Tom for this reply. I myself have only entered the BI world at the time when the first complete opensource BI solutions (the likes of Pentaho) started to appear. I left the BI world again two years or so later, and now I am reentering it after I discovered Powerpivot and it’s huge simplification over what I had used in my early BI time.

    Anyway, I have thought things over and in my current project, I (and hopefully my users) can very well live with importing dimensions multiple times where necessary, as long as I name things intelligently. I will also look at calculated relationships once I am out of prototype phase.