Death of the Star Schema?

With the release of the next version of PowerPivot around the corner (mid March I think), I’ve been re-acquainting myself with its new features. Most of the current version’s annoyances have been remedied (no drill-thru, no hierarchy support for example); and the additional enhancements to the DAX language (crossjoins, alternate relationships etc) make modelling m0st any problem possible (and generally easy).

The more I come to know PowerPivot, the more I believe that modelled data warehouses‘ days are numbered. I didn’t say data warehouses per se, rather those that attempt to centrally model end user reporting structures (usually as star-schemas).

There will continue to be a need for centrally controlled data warehouses (or at least simplified data views (and/or copies) of operational datasets, either provided by system vendors of by in-house IT) to bridge the raw-to-actionable data gap. But I suspect the emphasis will change from providing finished goods to providing semi-processed raw materials.

So, will the star-schema become redundant? No, as it’s still a valid method of modelling a reporting requirement in order to make many queries simpler to phrase (this obviously applies to SQL , but also to DAX queries). But, those who build them will be doing so closer to the problem at hand, and specific to that problem (I’ve discussed this before in http://blog.gobansaor.com/2011/01/08/slowly-changing-dimensions-time-to-stop-worrying/).

For many reports the barely modified operational data model will be all that’s required (for example, DAX doesn’t require “fact” header/detail tables to be flattened to detail level, as would be the case with a classic star).

“Good Enough” models will become the norm; classic “Everything You Ever Wanted to Know” centralised models a luxury for most (especially as such models tend to “age” very quickly).

If you’re about to invest or re-furbish your data warehouse or your reporting data sub-systems, don’t do so without first taking a serious look at PowerPivot. This is a game-changer, not just for full-stack Microsoft BI shops, but for any business that finds that their reporting datasets invariably end-up in Excel.

If you need any help evaluating PowerPivot or modelling your reporting needs in PowerPivot,  I’m for hire.

Update:

Just in case you think I’m an dimensional-model un-believer or likely to abandon my star-schema roots read this….

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

Advertisements

12 responses to “Death of the Star Schema?

  1. I think star schemas, if anything, will be on the rise once Excel folks discover the benefits they offer in cases like Sales vs. Actuals, as described at a high level in two of my recent posts.

    That said, I 100% agree with your stance that DW’s are going to be much more about raw materials than finished product. I’m right there, preaching the same message. I’d just suggest that star schemas don’t require a lot of carefully planned effort, they are very much compatible with the “raw materials” approach, and they provide real benefit to the Excel audience. Snowflake schemas, however… those might be REALLY dead 🙂

  2. @Rob, Yes, you’re right, more people will see the benefit of a star like models, but the difference from the past will be their (the models’) just-in-time, problem-specific nature. Too much money and good-will was wasted in the past trying to second-guess the future needs of businesses. Give your datasmiths the data (cleansed and simplified if need be) and also make sure they understand its meaning and potential, and let ’em loose …

  3. From the users (or business developers) perspective, the concepts of the star schema are inherent in how they (we) think about data. For example a subject area with things we know about (whether we call them attributes, fields, columns…). If anything, tabular makes this easier to model due to the underlying relationship functionality, so even if the data is not loaded as a star, it will be presented as one. So the schema will gain ground.

    For me, powerpivot breaks ground for the end user because it removes reliance on the modelled data warehouse by providing the ability to combine large information from several sources (something which IT as the custodians of the data warehouse are notoriously slow and expensive for). This could potentially see a changing of the ‘information guard’ as the warehouse lager of information needs rather than the leader.

  4. @Paul

    Yes, agree with both points.

    But there’s also a significant number of users (and business developers) who understand (or are capable of understanding) OLTP schemas and there are many who today report successfully against such schemas using SQL or SQL generating tools such as Business Objects.

    DAX (via the underlying relationship logic) makes the production of sophisticated reports against such models much easier than is the case with SQL. So for these situations a hypercube view of the data can be generated without strict star-schema modelling.

    But my (somewhat tongue in cheek) headline was not so much directed at the general use or benefits of dimensional modelling, but its traditional use as the end-product of centralised modelled data warehouses.

    Tom

  5. Why do you actually require a star schema? Is it really benefits in the longer run. It only add too much of costs and at time creates problems with the wrong data. Moreover, achieving a near real time data version is also a huge problem. There are ways other than star schema, where we can save cost and get a bigger benefit in the longer run. Can any body provide evidence that they have benefited with Star schema compared to other models in the longer run and with saving lot of costs?

  6. Well yes Bala, I, and my end users, have benefited from dimensional models in the past. Admittedly if I’d had the power of DAX in the past many of those models would not have been built, saving a significant amount of money. But many, even a majority, would still have benefited from a dimensional approach. Mainly this was due to the nature of the source systems; complex ERP, demand planning & CRM schemas, consisting of long chains of configurable many-to-many relationships, which DAX (or civilian or generated SQL) would choke on. In such cases, complex ETL was already on the cards, so picking a dimensional model,rather than a CDM – logical corporate data model – made sense (from both a complexity and cost-benefit pov).

    Tom

  7. Again, the source system itself will be having lot of built in reports. Moreover, everything depends on the design of source system. Still for reporting purpose, the design could be optimized rather lot of work on star schema and maintain it with lots of investment. Is it really worth deal?

    • If standard reports provided all the decision support information that businesses required we would never have had a BI industry (or tools like Excel/PowerPivot); they don’t, and never will.

      But yes, everything does depend on the source system data schema, a bespoke, data-design-close-to-the-business-view dataset will generally be easier to ad-hoc report on, either by SQL or DAX. However, most businesses depend on packaged software for their core operations, and being packages they tend to be highly configurable (mainly through config holding many-to-many relationships); such schemas can be very difficult to phrase reports against. Hence, the likes of SAP offering starter-pack star-schemas as part of their reporting tool set.

      Likewise, merging datasets from multiple internal/external systems (some may not even be relational, but API provided or using noSQL datastores) can often require the development of a new data model, again, in such situations a dimensional model may best suit. And even if not, potentially costly ETL will still be part of the equation.

      When OLTP data models are understood inside out, when staff have the necessary SQL or programming skills ,and have the time and the budgets, to quickly generate ad-hoc reports (and package them into “pretty artefacts”), then, no extra data models (or BI tools) are required. In such situations, building a star-schema may indeed be largely pointless.

      But very few of the world’s businesses are in that exact position. Tools like PowerPivot/DAX are not silver bullets, they are however, still bullets, which when used in the way best suited for a particular situation (some ETL-free, some heavily supported by ETL processes), are a godsend for many.

      Tom

  8. Tom,
    I agree that people need to understand the source data model. When people understand the source data model, the merging of data model becomes little bit easier and saves a lot of ETL process. But when a data mart is designed when people have not understood the source data, then it is alarming. Because of this only, lot of data mart projects enter into problems. If we reduce the ETL process, it saves time, effort , cost and erroneous data. Moreover, verification of data can be reduced post migration and customer can be confident about the data. Maintaining of single version of truth becomes easier. In today’s world, were data need to be moved to dwh environment very frequently say for every hour, then it adds more bottle neck to the ETL process and the star schema. When the star schema data grows out of bound, then everything becomes a headache. Always there will be trade off between etl process and post etl process (like inserting/updating large chunk of data and selecting the query. We need to compromise for quick insert/update or quick queries ). Maintenance of dwh db also a little costly affair. So will it not be better to avoid these process instead of doing some thing fancy?

    • Bala,

      “But when a data mart is designed when people have not understood the source data, then it is alarming ” yes indeed, but equally so when a “report” (be that PowerPivot or whatever) is generated by people who do not understand the data, this too is alarming.

      As per my post, even when data mart designers know the source data, the restricted view exposed by their data marts are often “alarming” as well, either due to lack of business knowledge or designing against a view of the business that is no longer relevant.

      And, even if the data mart designers get everything right on the night, the business, and the environment it operates in, continues to change; the models age and eventually fail.

      All have costs; ETL being usually the most significant, but bad business decisions made by bad analysis or bad data can be much more expensive. Data prep, be that simply in the mind of the data analyst, or codified in an ETL process, is the constant; which one will be needed is dependant on the circumstances.

      Star-schemas are simply a choice of model (a logical CDM, as per Inmon, is another option) generated as a result of an ETL process. For many current BI tools they’re not optional, you most provide a star-schema (or construct an OLAP hypercube) but tools such as PowerPivot (and the likes of Business Objects in the past) do not require one. The ETL process is where the cost lies, if you can do without it you’ll save up-front money, but only if your target data/business analysts are comfortable working with the source raw data.

      Tom

  9. The main problem comes when there are many changes and the data mart becomes inefficient due to the changes. After some time the data mart becomes stiff and changes to data mart becomes costlier in terms of design and performance. Sometimes it becomes more costlier than the normal OLTP performance and eats up lots of cost through hardwares. If the OLTP based dwh and star schema based design runs in parallel for few releases, certainly we will find lots of difference, if we monitor both with open mind.

  10. Pingback: There’s Cling-ons on the Star-Schema bow, Jim. | Gobán Saor