Category Archives: BI

Death of the Star Schema?

With the release of the next version of PowerPivot around the corner (mid March I think), I’ve been re-acquainting myself with its new features. Most of the current version’s annoyances have been remedied (no drill-thru, no hierarchy support for example); and the additional enhancements to the DAX language (crossjoins, alternate relationships etc) make modelling m0st any problem possible (and generally easy).

The more I come to know PowerPivot, the more I believe that modelled data warehouses‘ days are numbered. I didn’t say data warehouses per se, rather those that attempt to centrally model end user reporting structures (usually as star-schemas).

There will continue to be a need for centrally controlled data warehouses (or at least simplified data views (and/or copies) of operational datasets, either provided by system vendors of by in-house IT) to bridge the raw-to-actionable data gap. But I suspect the emphasis will change from providing finished goods to providing semi-processed raw materials.

So, will the star-schema become redundant? No, as it’s still a valid method of modelling a reporting requirement in order to make many queries simpler to phrase (this obviously applies to SQL , but also to DAX queries). But, those who build them will be doing so closer to the problem at hand, and specific to that problem (I’ve discussed this before in http://blog.gobansaor.com/2011/01/08/slowly-changing-dimensions-time-to-stop-worrying/).

For many reports the barely modified operational data model will be all that’s required (for example, DAX doesn’t require “fact” header/detail tables to be flattened to detail level, as would be the case with a classic star).

“Good Enough” models will become the norm; classic “Everything You Ever Wanted to Know” centralised models a luxury for most (especially as such models tend to “age” very quickly).

If you’re about to invest or re-furbish your data warehouse or your reporting data sub-systems, don’t do so without first taking a serious look at PowerPivot. This is a game-changer, not just for full-stack Microsoft BI shops, but for any business that finds that their reporting datasets invariably end-up in Excel.

If you need any help evaluating PowerPivot or modelling your reporting needs in PowerPivot,  I’m for hire.

Update:

Just in case you think I’m an dimensional-model un-believer or likely to abandon my star-schema roots read this….

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

Advertisements

HAMMER a new Excel ETL tool for the PowerPivot age …

So, why am I developing this new datasmithing tool when I already have microETL to do the same task?

To be honest, when I started, my aim was to simply port  a lot of my useful framework code to .NET as I’m doing more and more work in pure .NET these days.

It’s a lot easier to do so with a project driving the work rather than the drudgery of line by line conversions simply for the sake of conversion.

But over the last few months of investigating the process of moving some of more useful code over to .NET; discovering C#-SQLite as a near drop-in replacement for SQLite; realising that IronPython 2.7 marks the coming-of-age of Python under .NET; discovering the ease of multi-threading programming under Net 4.o; I began to see the potential of building a tool more suited to this new emerging world of technological plenty; of 64bit, multi-core high-RAM machines playing host to ground-breaking tools such as PowerPivot, than the 32bit, single-threaded  world that microETL was born into.

I wanted to bring microETL back to its roots as an in-cell function. To make it sit more easily with the spreadsheet functional style of programming, a style of programming that has been more successful at attracting “civilians” to programming that any other method. To make the tool more in tune with the Excel-way of doing things.

At the same time, I also wanted the tool to be “detachable” from Excel, so that it could perform ETL functions server-side without the need for Excel. Ideally capable of being run from the command-line from a single no-install-required executable.

And so, the Datasmith’s HAMMER was born.

So where does this fit in with PowerPivot? When I design something like HAMMER I normally craft it around a serious of “use cases”. Below would be a one of my PowerPivot & HAMMER  use cases:

The Data Quality Team

The sales & marketing department of a wholesaling company have discovered PowerPivot and are excited about using it to help manage an upcoming merger with a recently purchased competitor, also in the wholesaling business. There’s a large overlap between customers and products supplied, and the sales structures, pricing and KPIs used by both companies are very different. The transition phase from two separate companies to a single operating unit will need detailed  planning and constant monitoring.  Oh, and there’s also the problem of the brand new ERP system that’ll replace the existing systems currently in use.

The group can see how PowerPivot models will help with this task but are worried about the sourcing and managing the data. They decide to appoint a small data quality team, with a mixture of IT and business experience; which will be responsible for mapping old  to old,old to new, managing the data deltas (business will still goes on, new customers, new products etc.).

Most of this work revolves around “master data”. Transactional data may be much larger in volume, but get the master data right then transactional data will be easy to handle as long as the processing capacity to handle the volumes is available (which thanks to PowerPivot and really fast high-RAM PCs it is).

In golf, there’s a saying “You drive for show, but you putt for dough”. Likewise in data: a lot of emphasis is put on the big-data transactional datastores, but real analytical benefit comes from the less sexier master-data, aka the dimensions. As a result the most important output from the data quality team will be a set of conformed dimensions.

Each week (and at month-end) the team will get extracts of the master and transactional data in various formats from both legacy systems and will also get the WIP master datasets from the team working on the new ERP system. From these they’ll construct point-in-time conformed-dimensions combined old with old and old with new; and will rebuild transactional feeds with new conformed keys alongside existing keys. These datatsets are then sent to the various analysis teams to enable them build Excel and PowerPivot models that will hopefully all “sing from the same hymn sheet”.

And how will this be accomplished? With the datasmith’s HAMMER of course (well it is my day-dream use-case!).  Both in-Excel and command-line HAMMER “scripts” will be used to wrangle the datasets and the complete sets packaged as SQLite database files.

When the analysis groups receive these files, they too will use HAMMER to extract the datasets they require (either straight to CSV for loading into PowerPivot) or into Excel for some further transformation work prior to modelling. To handle slowing-changing dimension scenarios, many of the teams will squirrel away each week’s data in other to be able to model yet-unknown-but-very-likely changes of organisational structures.

Although this is a use-case, it’s based on reality, a reality that would have been so much easier to manage if I, and the teams I worked with, had access to technology such as HAMMER and PowerPivot.

To download the latest version of the code, go to this page on my website.

Follow the HAMMER tag on  this blog for information on commands and examples (best start with the oldest and work forward …)

Steam Powered PowerPivot

In previous posts I described various options for enabling PowerPivot models (or indeed any Excel model) as the hub in a hub’n’spoke application. This thin’n’core approach is the norm for web apps as it was for the client server applications that preceded them.

But as we know, Excel “applications” rarely follow this pattern.

They tend to follow a more peer-to-peer topology, often with each player being an “equal”; collaboration being the primary aim rather than service provision.

But even in such scenarios there usually is a hub at the centre providing the communication linkages. Most likely the company’s email system or a “shared drive” of some sort. In fact, on many corporate teams or projects, you know you’ve been accepted into the fold when the project manager says the magic words “I’ll get you access to the shared drive”.

But sometimes, hub-like service provision or a single controller role is required and in such situations one of the spokes takes over that responsibility. For example, in a budgeting scenario one spoke may be the source of historical stats, in effect, performing a service role.

If classic client-server uses the powered-hub-driving-the-client-spokes like the driven wheels of a car as an analogy, then many Excel systems could be compared to a steam engine’s driven wheels; where the hub provides a centre to rotate about but it’s a particular spoke that does the driving due to its linkages to the piston arm.

Such systems built around the sending back and forth of workbooks or CSV files via email or shared drives are common and most of us have seen them in action. Most tend to be manual, in that they require user action on both client and server side, but with a bit of VBA magic one or both can be automated. Options for doing so on the server side are:

  • A VBA loop that sleeps and checks for change, process any changes and sleeps …
  • Trigger a microETL.SQLScript using a COM automation call from a external program like that used in this scenario.
  • Trigger a script using an RTD Server that checks for new activity.
  • Use a “Javascript” timer as per Excel Hero’s neat trick.

RTD servers are rarely encountered in the wild as they only appeared on the scene in Excel XP and were not the easiest things to code or understand but now new tools such as ExcelDNA makes coding RTDs much more approachable and useful.

Daniel Ferry‘s JSTimer idea is a simpler VBA alternative to an RTD and is non-blocking unlike a pure VBA sleep-check-sleep loop. It also has the advantage that no add-in is required so the client-side could be a single-file macro-enabled workbook.

Using a shared folder or email as the conduit for passing data back and forth and using Excel files or CSVs as the data artefacts may be the obvious choices but other “more modern” alternatives exist.

For example, for situations where a shared folder isn’t an option why not use something like Amazon’s S3? S3 is secure (much more secure than external email), reliable, simple to use and extremely cheap. A modern-day alternative to an FTP server. Ideal for big corporates,SMEs and sole-traders.

Orchestrating the processing of files on S3 is very simple using tools such as Python’s Boto (another good reason for having Python in-built in microETL). For the client-side, pure VBA could  be used, see my xlAWS code.

Likewise, using workbooks as the medium of data exchange is usually the first option, with CSV files a close second. But since Excel 2003 Professional, Excel has had excellent support for the exchange of XML files via XML Maps. Although not useful for highly complex schemas or very large datasets (both of which are anyway the work of the devil 😉 ), XML Maps are really good at handling relatively simple dialog and configuration inter-change requirements.

The process of importing/exporting data via XML can be easily automated using basic VBA but can also be safely left under manual control. XML files can also be generated and read by non-Excel components.

So when you’re confronted with a PowerPivot model that is not suitable for direct sharing, due to:

  • technical constraints e.g. non-2010 clients
  • or business reasons e.g. detail data behind model, or the model itself, needing to be kept secret
  • or model complexity reasons i.e. too complex for a single PowerPivot model …

… investing in a top-of-the-range SharePoint farm is not your only option. By utilising some of the enormous power that comes as standard with Excel such as VBA, XML Maps, RTDs etc. further empowered by an Excel-focused ETL tool such as microETL a lot can be achieved on a relatively small budget.

If you need help designing or configuring such a scenario or need help in understanding where PowerPivot can fit in to your decision making and information distribution requirements, contact me. My job in life is to provide than help …

Download latest version of microETL from http://www.gobansaor.com/microetl

PowerPivot Nested Set Measures

In my Handling Flat, Parent-Child and Nested Set HierarchiesPowerPivot Powered Budgets & Targets posts I discussed my use of Nested Sets as a means of modelling unbalanced and ragged hierarchies. My 1st instinct when confronted with a non-trivial hierarchy, flattened or not, is to reach for my trusty Nested Set builders. However it’s not just as an ETL re-modelling method that a datasmith is likely to encounter nested sets, they also exist as variations-on-the-theme in production system data models. One of the most common use is in the modelling of systems’ ACLs (Access Control Lists). This use of range queries (often utilising “positive ranges” MINUS “negative ranges”  plus or minus “ragged lists”) can be very complex and would prove difficult to model using the flattened cross-join options available in PowerPivot (point solutions by “hard coding” a fixed set of relationships are possible, but a generic “handle all comers” solution would be very difficult).

But should such ACL-like models be used in PowerPivot cubes? In general I would say no. For two reasons:

  • The people building and utilising self-service BI solutions, like PowerPivot, tend to be high on the “need to know” hierarchy. They tend to have and need full visibility of  an organisation’s data. Such security as is needed will tend to be at the data source level i.e. what full datasets they have access to, or what filtering of such data is required, and is usually controlled by an IT-staffed gateway service of some sort.
  • As PowerPivot is, in the main, about just-in-time task-specific analysis, the simplification (via flattening or ‘hard coding’ of relationships) will usually do the trick.

But, what if you really needed to model such a thing in PowerPivot, is it possible? Yes, is the simple answer. See below for an example of a Nested Set Measure (click for larger view):

The example shows a simple “get me and mine” downwards range query, but the full power of Nested Set queries is possible. The downside of this approach will be the “flatness” of the views achievable using Excel pivot tables. The addition of a “display column” where node names are indented by the number of spaces in line with their “depth” can help. But where I see this approach being useful, the problem with pivot table representation would not be a major problem.

This type of cube would most likely be implemented as a “hub” cube or accessed via Excel’s CUBE  formula. In other words the presentation of the data would be through other cubes or simply via “Excel reports”.

An example of where this approach might be useful would be in preparing data for generating employee performance reports. Employees might have complex KPIs that depend on their viability of the organisation e.g. temporary tele-sales staff might only see and be measured against their “own data”, while established sales staff might be measured against their team’s performance. Also, staff such as marketing support who although embedded in sales teams might be measured against sales of a brand within a specific geographic region.

Is this easy, or is it indeed self-service BI? There would, in the scenario above, still be a lot of self-service i.e the excel reports and the “spoke” cubes; even the Nested Set measures could be coded by “civilians” once they were provided with a template and with perhaps some re-modeling of the source hierarchies by professional ETL’ers.

But as I keep saying “ETL is hard, get over it!“. Modern operational systems can be devilishly complex, and even when not, the correct reading of a system’s data model is a skill that takes sometime to master – and having basic SQL skills is essential.

That’s why SQL is the heart of my microETL tool, because it is a serious ETL tool and if SQL (and I’m talking very basic SQL here, the sort SQLite provides) is beyond you (or you have better things to be doing with your time 🙂 ) then you should assign the back-end ETL to others.

The good news is that once the ETL back-end is “sorted” the new breed of self-service BI tools really do work; just don’t expect them to magic away the inherent complexity of source systems. Maybe one day they will, but that day is not today.

Slowly Changing Dimensions – time to stop worrying.

Anybody who has worked on a major data warehouse project has war stories about that most awkward of BI concepts, the slowly changing dimension (see http://en.wikipedia.org/wiki/Slowly_changing_dimension).

The problem is not so much the need to keep a history of changes to master data attributes nor how to model these in the underlying data warehouse but how, or if, to model SCDs on end-user facing cubes.

SCDs have the potential to make end-user interaction with a cube more complex and error prone. A similar problem arises when a dimension can have several “roles” such as a sales fact with an order date,  dispatch date and invoice date.

Where the nature of the problem requires the exposure of dimensional change in a cube model then there’s no alternative. But, when the justification is that there might be a need in the future or that occasionally in the past that need existed then the argument that front-end cubes should solve specific not generic problems should apply. Holding that line was not always possible when the cost of developing cubes was high due to professional IT involvement and the well founded fear that the money might be wasted as requirements changed over time. The result was often cubes that were more generic (and long-lived, often to the point of irrelevancy) compared to a design that was more tightly focused.

This is where I think self-service BI tools such as PowerPivot or Palo changes things; if those who pay for BI need no longer insist that a design is future proof or that a single model be capable of handling all the current views of the business they can concentrate on building just-in-time highly focused (and hence usually simpler) models.

Different groups in the same business can have variations that suit them e.g. HR/Payroll might was to see a sales rep who has changed sales area reported against the rep’s actual sales while Sales/Marketing might be interested in the potential/history of the rep’s current sales area. Likewise, with date roles, call centre management might wish to report against order date, while the warehouse guys focus on dispatch date, and Accounts only have interest in those transactions with an invoice date.

But what about single-version-of-the-truth worries, what about reconciliation? This is where professional IT come back into the picture. Self service BI doesn’t mean do-it-all-by-yourself BI, no more than a self-service sausage and bacon hotel breakfast means you first catch a pig. The marshalling and delta management of data remains an IT responsibility, very small organisations and/or skilled end-users may be the exception to the rule, but ETL and data repository design are non-trivial problems and look likely to remain so. But with more and more of the front-end been delegated to (or snatched by) business end-users, the nature of back-end ETL and data modeling will change. Many of the resources once applied to a fully IT owned BI project will be focused on speedy, cost effective and accurate data provisioning, a less glamorous and largely invisible role compared to putting pretty dashboards on the CEO’s desktop. As a result, many IT staff with front-of-house skills who are now regarded as IT “assets” will more and more see themselves (and be seen) as front-line business ones.

So is this end of the road for the all-rounder, the catcher of pigs, frier of bacon and breakfast waiter rolled into one? No, I think not, there will continue to be a need for those individuals (or small teams) who can quickly and effectively tackle the once-offs, the emergencies and the need-to-knows that “normal” business practice continually brings to the surface. Small business (or isolated departments in larger concerns) will also continue to need tailored “end-to-end solutions”.

While I do see the continuing need for all-rounders their background (and hence the tools and product features to support them) will change from IT folks with business skills  to end-users with IT skills. Tools such as PowerPivot are an important first step on that journey, more end-user focused ETL tools (and services) is the next step. In the mean time, the world will have to depend on the likes of me (IT guy with business skills armed with a hodge-podge of ETL tools and techniques to bridge the raw-to-actionable data gap) for most of their BI all-rounders.