Why use a dimensional data model for PowerPivot?

Why use a dimensional data model for PowerPivot? This is not about why one should, or should not, convert an OLTP model into a dimensional one, this post looks at the why from another direction i.e. morphing a “flat” table model as used to feed traditional Excel Pivot Tables (and other pivoting platforms) into a more PowerPivot/DAX friendly format.

Many of the applications of PowerPivot I come across are simply barely modified traditional Pivot Tables migrated to PowerPivot. I probably come across a lot these due to automate PowerPivot refresh code, as having make the cross-over it then dawns that the Excel Object model doesn’t support PowerPivot automation (something that Excel 2013 thankfully addresses), so many come to me looking for help automating the process.

But why move such  models to PowerPivot? The usual reasons are PowerPivot’s ability to handle very large datasets and to take advantage of all that DAX offers, not just for measures, but also for ETL purposes.

Migrants from Pivot Table land, quickly come to appreciate the in-built relationship functionality as a relief from the drudgery (and error potential) of VLOOKUP and INDEX/MATCH but often simply use it to continue to produce single flat tables. Eventually, the visual benefit of having separate table “containers” for different types of data is noticed. But this separation of data into dimensions is not just a categorisation trick it’s much more.

The Time Intelligence Functions are particularly good at bringing DAX window-shoppers into the store. And it’s here that the first real need for a “dimension” raises its head. You can get away without a Date (aka Time) Dimension and many did/do (mainly due to the fact that in the early days of PowerPivot the “rules” weren’t very clearly spelled out). But, to make best use of these functions, you need to follow these rules:

  • Create a Date Dimension, see here for a definition and the DAX to create one.
  • Ensure any relationships between this table and others are based on Date columns (not Date +Time columns!). Or if using >V1, the option to mark the table as a date table is available; this allows non date columns to be used as Prime/Foreign keys (as would be the norm in traditional Data Warehouses).
  • Never (well hardly ever) use the Date column on the many side of the relationship (usually a “fact” table”) as the date parameter for Time Intelligence functions; use the Date Dimension’s date column instead.

But why?

The need for a no-gaps list of dates is due to the fact that in most models the Date dimension is a so-called sparse dimension i.e. not every date will have a transaction recorded against it.

So why does that matter?

By using this guaranteed no gaps list the Time Intelligence functions (mainly “windowing” functions like Year To Date (YTD), Moving Annual Totals (MAT) etc.) can be sure to have valid data point to work against no matter what other selections are in force.

So, for example, if a particular combination of selectors showed sales for Jan through May but no sales for the rest of the year, a Year To Date (YTD) measure would show the cumulative YTD figures up to May but would also continue to show values for the remaining months of the year as long as the date referenced is on a Date Dimension and not on a “fact” table. In SQL terms, the date dimension table is outer joined to the fact table, always ensuring a date value (usually the max date value in the current selection) even if no fact values. If however, the date on the fact table is used, those months with no sales (and no rows) would show YTD as blank.

Also, the Time Intelligence functions perform a hidden trick. When such a function realises that the date column fed to it is the unique date column key of a table it implicitly applies an ALL() against that table. This has the effect of removing any other date filters so that the “window” can be extended outside the current date context.

The ability to use an ALL() filter against a set of known and not yet known attributes grouped by a logical “dimension” is also the primary reason for ensuring other dimensional grouping such as customer, product etc. are separated out into their own tables. It is of course possible to produce the same result on a flat table by specifying ALL() against each column that represents a particular dimension within the base table. The problem with this approach is, as a model matures or is copied and modified it is very common to add new attributes to a dimension, without the  “ALL() against a table” trick, each measure using that dimension would need to be checked and possibly modified every time an attribute is added (or deleted).

Most major dimensions are sparse like the Date Dimension. Their members will not always be represented within all possible slices of the dataset. Being able to create measures that, for example, measure the ratio  of customers that purchased this month against total number of customers is much easier with separate dimensional tables.

The above screen capture is taken from this example of a date dimension in action, based on the Orders table from the Northwind oData database, the three measures are:

YTD:=

CALCULATE(SUM(Orders[Freight]),DATESYTD(DateDim[FullDate]))

YTD_BasedOnFactDate:=

CALCULATE(SUM(Orders[Freight]),ALL(DateDim),DATESYTD(Orders[OrderDate]))

YTD_BadVersion:=

CALCULATE(SUM(Orders[Freight]),DATESYTD(Orders[OrderDate]))

YTD being the goody-goody following all the rules works as expected.

YTD_BasedOnFactDate tries to be smart, uses the date on the Orders table and explicitly applies ALL(DateDim) to expand its time window. Works but not when the month contains no transactional data.

YTD_BadVersion ignores all the rules and doesn’t work.

Advertisements

10 responses to “Why use a dimensional data model for PowerPivot?

  1. Pingback: There’s Cling-ons on the Star-Schema bow, Jim. | Gobán Saor

  2. Pingback: Playing DAX Percentiles on the mean (or is that median) Streets of Ireland | Gobán Saor

  3. I’ve read at several places that using surrogate keys in Powerpivot is not recommended, but nowhere have I found a reason why this is discouraged. What happens when I keep on using them, what are the drawbacks? Should I get rid of them altogether? Or just for the date dimension? Or can I continue using them?

    Thanks

  4. I think “not necessary” would be a better reason.

    Surrogate keys are really a large-scale data warehousing construct, unnecessary ,& can be error prone, if say loading data straight’ish from OLTP. The sort of thing IT folks rather that datasmiths obsess over. Having said that, if you’re sourcing data from a trad.DW, work already done, continue using them. Oh, and if using techniques such as Type II SCDs then surrogates are here to stay.

    In the previous version of PP, date dimensions had to use natural (date minus time) keys to get best usage of time-intel funcs. This is no longer necessary, simply mark the time dim table as a “date table” and continue to use surrogates.

    Tom

    • Excellent advice, thanks! Am am seeing that much of the traditional dimensional modelling advice can be skipped (in dimensions: surrogate key, null row, type column to distinguish null row from the rest of the rows), and still a basic star schema lends itself also to use with Powerpivot.

  5. Pingback: M’s back – Create a PowerPivot Time Dimension using Excel’s Data Explorer | Gobán Saor

  6. patrick mccarthy

    so, what if I have multiple date fields in my fact table? PP only supports one active relationship at a time (I can DAX against any of the relationships but end users cannot see the different Date Dims in the UI)..

    do you then recommend duplicating and triplicating the DateDim?

    • No.

      Multiple dates dims exposed to end users tend to cause confusion usually a result of the exposed model being too generic. Best pick a single date driver and convert the rest to measures. So in a classic order-to-cash pipe line; at least three date dims: order, delivery,invoice. Match driving date to audience or to current status of sale. Create “days since order”, “days since due-delivery date” etc. measures to cover missing dims.

      Tom

  7. patrick mccarthy

    Tom, thank you for the quick answer. An interesting concept. Think support ticket data: create date, close date… User will want to pivot and group by create date sometimes (ticket counts by create date, user, type etc) AND by close date (ticket counts by date closed, user, type etc)…

    For prototyping we have imported a DateDim two times, renamed it to CreateDateDim and CloseDateDim then built the relationships… works but will be more maintenance..

    It would be nice if PP could leverage multiple relationships like SSAS does for shared dimensions…

  8. Patrick,

    In V1 of the add-in you’ll need to use separate dim dates for this but since V2 the USERREALTIONSHIP function http://msdn.microsoft.com/en-us/library/hh230952.aspx introduces a concept similar to roles.

    The trick is try an solve the problem by measures not dims, in other words models that say expose two customer dims or two date dims are likely to be confusing to your end users. Sometimes you can’t help that, but usually it means the model is too generic (due to developer being too DRY or the cost of specific views being too expensive see http://blog.gobansaor.com/2011/01/08/slowly-changing-dimensions-time-to-stop-worrying/).

    The ideas in this http://mathieu.fenniak.net/stop-designing-fragile-web-apis/ on API design could also be applied to end-user-facing cube design.

    Tom