Alberto Ferrari describes a method of handling un-balanced hierarchies in PowerPivot. As usual Alberto comes up with an elegant solution to something that looks near impossible. For more of the same, check out his book, co-authored with Marco Russo, Microsoft PowerPivot for Excel 2010: Give Your Data Meaning. It’s the best book out there for those from a data modeling background trying to make sense of this new form of ROLAP or indeed for those from a spreadsheet background who wish to gain a better insight into PowerPivot-oriented data modelling.
PowerPivot has problems with un-balanced hierarchies (aka variable-depth parent-child hierarchies) as it’s essentially a relational engine. Relational databases have traditionally sucked at handling such hierarchies which are usually implemented using an Adjacency List Model.
When I moved from network CODASYL databases to relational databases all those years ago it was my biggest complaint about the ‘new order’ of the RDBMs. I guess it’s my network database heritage that makes working with MOLAP engines such as Palo seem so natural. Such databases are kind to hierarchies and to the users of hierarchies.
Most of the major commercial databases have now extended their SQL offering to handle hierarchy navigation, you can see the SQL Server’s way in Alberto’s post. But SQLite offers no help, so in order to make handling for what are in fact very common requirements for reporting and analysis datasets, I’ve extended the SQLite SQL available within microETL with 4 useful helper functions.
All 4 work with tables of the following structure:
Create table whatEver (HierarchyName text, Parent text,Child text,Name text,Level int,setID int,setLeft in,setRight int);
- HierarchyName allows for multiple hierarchies to be stored in the one table.
- Parent is the parent node for “this Child”. Can be text or a numeric surrogate key. Top level nodes will be set to null (or = “”).
- Child is the node identifier.
- Name is the textual identifier. If not populated, uses Child value.
- Level is the depth from top of the current path. Levels = 0 are top level nodes. Not strictly required, but so useful it’s worth populating.
- setID like Level is not required to be populated but can be useful in Nested Set queries.
- setLeft is the left-hand side of the Nested Set range associated with this node.
- setRight is the right-hand side of the Nested Set range.
What’s all this Nested Set stuff? As I pointed out above, SQL engines tend not to agree on a method of hierarchy navigation, if indeed they offer any method. Nested Sets is the classic data modelling method to get around this problem (if you look at hierarchy tables in cross-RDBMs packaged applications you’ll often find a variation on this method).
See this post Managing Hierarchical Data in MySQL for a good description of the technique.
There’s no need to populate these ‘set columns’ unless required they’re there just in case.
This function is for handling the opposite problem to that which Alberto faced. It takes an existing flattened hierarchy and turns it into a “proper” parent-child one (hence the pc element of the name). Why would you do that? Well, MOLAP tools such as Palo not only handle unbalanced hierarchies, they usually store hierarchies using an adjacency list model. Also, if you wish to use Nested Set SQL, you’ll first need the model in parent-child format.
The function’s 1st argument is the name of the hierarchy table; if it exists it must be in the format specified above, otherwise a table of that name with the required format will be created by the function.
The 2nd argument is the hierarchy’s name.
The remaining arguments are the source columns for the hierarchy in the sequence of level 0 to level n.
The function is in fact a aggregation function (like SUM()) so can be used to group the required columns from a table where they are repeated (such as a fact table on which the hierarchy has been denormalised).
Example (click on it to see larger version):
This is similar to make_pc_hier() except the source table must be in a ‘tree format’. It is also an aggregate function but the use of group by would, in this case, not make sense. The use case for this is the easy creation of hierarchy dimensions from a tree structure for use in the likes of Palo.
This takes four mandatory and an optional fifth argument. First two are the usual hierarchy table name and hierarchy name. The 3rd is the output table to receive the flattened table. The first column of this output table will be populated with the base element Child IDs (i.e. the deepest level in the nodes’ paths). The remaining columns must match the number of the next argument. This 4th argument specifies the maximum depth of the flattened hierarchy, so if the hierarchy is to be, for example, Sector,Brand,Product, then this would be 3.
The optional 5th argument if set to Y will not fill unused columns in an unbalanced hierarchy, otherwise the final nodes are filled out to the right with the last non-blank Name. This is a “stored procedure” like function, so should be called using “Select function(arg1,arg2 …);” syntax.
This takes the two usual arguments, hierarchy table and hierarchy name. Its function is to populate the “nested set” fields of the supplied table to make a valid Nested Set. Again, it’s a “stored procedure” function so should be called using the “Select function(arg1,arg2 …);” syntax.
Download microETL from http://www.gobansaor.com/microetl and locate the hier97-2003.xls file.
The microETL project is password protected; if you need access to the code just email me I’ll send you on the password.
There’s now an Alberto_hier workbook (in 2007/2010 format) with a worked example of building Alberto’s modified dimension. This workbook also includes a PowerPivot cube demonstrating the required measure (plus an equivalent Nested Set SQL Query).
For more on handling nested sets in PowerPivot see http://blog.gobansaor.com/2011/03/10/powerpivot-nested-set-measures/