Category Archives: olap

DAX Table Queries in Excel

Although the discovery of what Rob Collie calls a live specimen of Bigfoot itself i.e. automated refresh of PowerPivot models from VBA is indeed welcome, the activities that led to its discovery were intended to serve a different purpose, namely the use of DAX table queries in Excel.

Dax table queries are a new feature of DAX (part of the Denali PowerPivot upgrade) that in my opinion takes the DAX language away from its (very useful) pivot table birth-place and makes it a truly powerful alternative to SQL/MDX for analytical reporting.

It was Chris Webb’s excellent  series of articles on DAX that first wetted my appetite for this feature and then hinted that it could be accessed from Excel and ultimately automated if need be.

An so it can, the VBA code that I used to query DMVs and send XMLA process commands to refresh the PowerPivot model can also be used for MDX and DAX tabular queries.

And it gets better; you don’t need to use code to do this, there’s a very simple trick which will expose this functionality in plain Excel.

This trick only applies to Denali as it takes advantage of one of Denali’s most welcome new features, show detail (aka drill-thru). (An alternative method which can be used in PowerPivot V1 is detailed here  – if using V1, only DMV and MDX queries would be possible).

First thing to do is (in a PowerPivot Denali powered workbook) make a pivot table (doesn’t matter of what), such as this …

… then right-click and show-detail on any measure cell, which should result in a new sheet with a Excel table showing the drill-through, like so …

… if you then right-click, pick Tables and then Edit Query, you’ll see the MDX query associated with the table …

… you’re now free to edit this query, replacing it with whatever takes your fancy, here’s an MDX tabular query …

… you can also issue DMV queries such as “Select * from [Model].[$InvoiceHeaders]” or its equivalent DAX table query …

Note: both the DMV and DAX ‘dump a dimension’ are unlikely to work for very large tables (such as a multi-million row facts) unless you have loads of free memory and the ability to use it (i.e. 64 bit), without triggering a “could not allocate memory error”.

DAX is not limited to DMV type queries but can be used to express very complex OLAP queries, Being able to define new  temporary just-in-time Measures (and Variables) for use within a particular queries will help not just to produce the result required, but will make DAX easier to master for newcomers.

The current version of HAMMER only allows DMV and MDX queries as it restricts the “ADO” command  to “SELECT” statements; the next release will remedy that.

I’m at present working on an example of using HAMMER to build a DMV/MDX/DAX Web Service such as I did with MicroETL in Python-Powered PowerPivot. This time using IronPython and the .NET’s HttpListener and simply passing the queries straight-through to PowerPivot without any need for Excel Pivot Table involvement.  So stay tuned … Update: (here it is)


DAX the new noSQL?

In my previous post Dax the new SQL I argued that perhaps DAX would replace SQL as the database query language, in particular for those I term “civilian datasmiths”. But perhaps I should have hailed DAX as a new contender in the world of noSQL.

When I started in this business in the late 1970s, the database world I encountered was a noSQL one. Database structures were in the main laid out by DBAs using either hierarchical (e.g. MUMPS) or network topologies (the various CODASYL offerings). Programmers would then navigate these predetermined access paths using a cacophony of APIs and query languages. It worked quite well, and if fact many developers found the move to a SQL world difficult, as they regarded the pure relational model to be sub-optimal when developing applications (and many still do!).

But there were two main problems with such databases:

  • Although the pre-defined access paths worked well for development, they were often next to useless for downstream reporting, which usually meant that data had to be dumped to files for “reshaping” and de-normalisation, with resulting flattened data often being consumed by early end-user reporting tools such as Datatrieve. (Sound familiar?)
  • Likewise, although the original data design tended to be ideal for the original application design, downstream changes in requirements were not easy to incorporate. DBAs were busy, expensive and usually very grumpy, which meant that many enhancements, as with their reporting cousins, had to make do with using external files.

The SQL era was a huge improvement. Nevertheless, many developers found the impedance mismatch between this relational SQL world and their own  “networked”  object-oriented world to be a continuing problem, leading to the constant search for the perfect ORM and eventually the re-emergence of noSQL database engines.

PowerPivot’s VertiPaq engine is one of these new noSQL databases. It resembles those pre-SQL databases in that its access and relationships paths are largely determined not by its DAX query language but by whoever builds the database in the 1st place. The big difference is of course, PowerPivot’s primary (and only) use-case is reporting while most  databases have in the past targeted system datastorage as their primary purpose in life, with reporting as a by-the-way. The other difference is that the creators of PowerPivot “databases” are less likely to be DBAs and more likely to be end-users or at least those in IT, like data/business analysts, who are close to the business i.e. more common on the ground, less expensive and hopefully less grumpy. Indeed many civilian DAX practitioners will have the luxury of being able to populate their reporting models with suitable datasets that have already been constructed by existing data warehousing systems.

It’s this separation of filtering- aggregation-transformation from topology that I think will help endear DAX to a less technical audience that has been the case with SQL. MDX also had this concept, but the underlying “more natural” tabular nature (business users live by lists!) that DAX builds on, will I think,be easier for many to engage with.

I used “I think” at lot in the previous paragraph as it’s very difficult for somebody like myself (a professional programmer and veteran of many a database mind-share war) to know how non-technical folks will actually view something like DAX. Although my career has been one largely based on business-user interaction, I’m still a techie at heart.

So what do you civilian datasmiths think? DAX, yep I’ll learn it, or nah, that’s what those geeks down in the basement are for!

DAX the new SQL?

The latest version of the technology stack behind PowerPivot (“Denali” CTP3) is now available for download. This expands the technology significantly, with the VertiPaq engine taking on an equal role within SQL Server AS alongside the traditional OLAP engine (and it’s clear the intention is for this new “tabular” BI approach to eventually replace its venerable OLAP cousin).

Of course, tabular BI (aka reporting) is not new, it was and still remains the primary reporting framework for most. Pure OLAP has always been a minority sport (admittedly a well funded and very successful one). Relational datastores hold most of the world’s business data, so naturally those seeking  to report and analyse this data tend to opt for relational tools. Most such tools are SQL driven, whether  hand-written or generated, the 40 year old query language is still master of its own house.

Microsoft’s MDX language attempted to create an alternative to SQL for BI reporting. From a purely technical point of view it has succeeded, MDX can model every conceivable reporting requirement but its ability to replace SQL was always going to be hampered by:

  • The need to build an intermediate non-relational model to operate on i.e. OLAP hypercubes. This is not to say that OLAP models are not a useful way of modelling a business problem, for certain requirements, top-down budgeting, for example, they’re ideal; but for the other 90% of reporting requirements, a relational model works as well if not better and has the advantage of reducing the “impedance mismatch” between the operational and reporting models.
  • MDX is a very difficult language to master, at least compared to SQL. It surpasses SQL in it ability to frame a complex query, but if expecting “civilian datasmiths” to learn SQL was perhaps a realisable dream, expecting end-user MDX mastery was always a pipe-dream. MDX is, and will remain, the preserve of professional programmers (or auto-generated by the query tools built by such professionals).

Ah, but could not the same be said of relational based reporting efforts; star-schemas  are logically different from OLTP models, SQL also remains largely the preserve of professionals? Yes, the necessity to create a separate model, adds a level of friction to the process, but the reasoning behind creating star-schemas is to simplify the complex SQL that would otherwise be required to access complex OLTP schemas (and most schema are complex these days as bespoke data models have been replaced with models that support highly generic packaged solutions).

Pure star-schemas (i.e central unified fact table surrounded by a single layer of conformed dimensions) also made it easier for RDBMs to add suitable indexing structures and optimisers to enable fast query responses. In-memory columnar stores such as VertiPaq negate some of the performance reasoning for employing pure stars (the primary simplification reason remains) enabling a return to more “natural” snow-flaked schemas and dis-jointed fact tables (e.g. order-header & order-line rather than a denormalised order-line-fact).

What about SQL then?Although the Q is SQL stands for Query, SQL failed to capture the heart and minds of those who undoubtedly could have benefited from its query power due to the lack of easy to declare powerful query expressions. This may seem a strange thing to say, as anybody who has fully mastered SQL and understands OLTP design principles will tell you that nearly any query is possible using pure SQL. And is SQL not easy to learn?  Yes it is, and for say ETL tasks the level of SQL mastery required is not the great, but once you stray into the world of cross-table reporting, time windows, complex aggregations, most civilians will abandon the effort and revert back to using sub-optimal spreadsheet techniques, often patching  together solutions that are both costly and fragile.

This is where I think DAX will shine. The language has the potential to pick up where SQL faltered and MDX overshot, a query/reporting language that will empower both professionals and end-users. Easier to learn that SQL for low-hanging fruit, but with the power of MDX (but within easier reach) for those complex, but high-return, queries.

But will civilians invest the time to master DAX or will it become yet another SQL or MDX, largely confined to IT? I think it has a fighting chance mainly due to its close relationship with Excel. SQL always felt like a subject best learned in a university lecture hall, while DAX 101 could easily become a night-class at your local adult-education centre this autumn.

I haven’t had a chance to fully investigate the new features of PowerPivot in Excel yet, but one new feature that’s going to help PowerPivot become the default way data analysts build pivot tables (and hence drive the adoption of DAX) is the ability to show details (aka drill-thru). Another, is the ability to sort columns using another column (e..g. show Month Name but sort by Month Number). Both of these may seem trivial compared to the other deep-technical changes such as the ability to natively host PowerPivot models in SQL Server (i.e. no need for SharePoint). But these two features where major missings for those who depend on “normal” pivot tables for dashboarding and discovery purposes.

Other new features that caught my attention where:

  • New relational functions: CROSSJOIN, GENERATE (Join), GENERATEALL ( Left Outer Join) and SUMMARIZE (Group by).
  • Ability to generate hierarchies, and functions to navigate such hierarchies.
  • Ability to handle multiple relationships between tables (via functions that can define a new relationship at calculation time). This will allow for “roles” such as using billing date for one measure and order date for another.
  • Removal of restriction on fact table date dimension foreign keys being of type Date. Will make direct import of many data warehouse star schemas much easier (most traditional DW star schemas use integer surrogate foreign keys).
  • New “programming” helpers such as SWITCH and HASONEVALUE, will make construction and interpretation of complex measures easier .

So what do you think,? DAX, new dawn or old wine in new bottle?

Excel in Data

My speciality is data, more specifically, data at the edges. Data that has been extracted from, or needs to be generated outside of, formal systems. Such data tends to be the extremely valuable and useful, but only if shaped into a format that has meaning to those who can use and analyse it. I’ve been extracting, marshalling and shaping such data for most of my 30 years in the IT business, I’m a datasmith.

The technical term for what I do is ETL (Extract, Transform & Load). It’s the name given to the skill-set and tools associated with the movement of datasets. It’s the unglamorous, but essential backbone, to a whole range of IT deliverables: Data Warehousing, BI (Business Intelligence), Planning & Forecasting, New System Commissioning, Systems Integration.

In the past my skills would only have been of use to large IT departments, as major “data shaping” activities such as BI & data warehousing were “owned” by IT; small businesses and departmental workgroups were outside the loop except as providers of requirements (and money) and as long-suffering “end users”.

This alienation from the process of data shaping and the need to simply “get things done” led to the evolution of “shadow IT”, usually involving Excel is some form or another. Although many attempts have been made to control and tame this wild-west of data creation and consumption, they’ve largely failed. With good reason, the best people to get value out of data are those who depend on and generate the data in the first place.

Microsoft has long recognised this and has over the years improved Excel’s ability to handle and pivot lists (aka tables, the primary data artefacts of business). Excel’s PivotTable is undoubtedly the world’s primary BI tool.

Microsoft has, with Excel 2010, introduced an even more powerful end-user pivoting tool, the free PowerPivot add-in. The world of self-service data analysis has changed for ever; not only can PowerPivot handle enormous amounts of data but it enables end-users and data analysts alike to build highly complex analytical models at a fraction of their previous cost.

PowerPivot offers the greatest returns to those organisations or individuals who have:

  • access to report-friendly data sources, such as those provided by existing data warehouses.
  • The resources to purchase and maintain an Enterprise SharePoint farm (backed by Enterprise SQLServer) allowing PowerPiviot dashboards to be shared (and controlled) via a web interface. Such infrastructure, although ideal, comes with an expensive price tag.

So, if your organisation or your team is missing one or both of the above is PowerPivot not for you?

Far from it, PowerPivot is still extremely useful without either. Data that’s not in the ideal format can still provide useful analyses, and both Excel and PowerPivot can help cleanse such data. Likewise, just like an ordinary workbook, a PowerPivot workbook can be emailed or placed on a shared drive (all the supporting data is embedded in the .xlsx file). The only problem is, the time and resources associated with this activity can be substantial if you don’t have the ‘micro ETL’  and general Excel VBA skills to help automate it. This is where I can help.

I’ve both the skills and the tools to build a rock-solid Excel-based ETL back-end for your PowerPivot projects. This can be as simple as a “good enough” data shaping automation for a single PowerPivot cube or a sophisticated ‘micro-Data Warehouse’ which would act as a foundation for a more long-term self-service BI strategy. The only software product you’ll need to purchase is Excel, I’ll provide the rest. Most of the time Excel 2010’s  & Powerpivot’s in-built ETL capabilities, automated with some VBA, will be all that is required but when more ‘firepower’ is required:

  • I’ve assembled my own Excel based ETL tool specifically to enable me to quickly and cheaply build such micro-ETL solutions, it’s a collection of best-of-breed open-source tools (such as SQLite) driven by, and integrated into Excel (versions 2000 – 2010, 32bit & 64bit).
  • I can call on Palo OLAP (an open source, Excel focused budgeting & forecasting tool)
  • and on Talend, a leading open source ETL and data-management suite.

I can also automate the work-flows and non-ETL repetitive tasks associated with such projects. Tasks such as: generating questioners, marshalling and recording the resulting replies, generating non-PowerPivot workbook reports/pivots (remember, unlike previous versions, Excel 2010 can exist side-by-side with older versions of Office, only need to provide it to those who need its full power).

As for building PowerPivot cubes, I’ve a long background in the design & delivery of ROLAPMOLAP solutions. PowerPivot is essentially a 3rd way of producing OLAP cubes, heavily influenced by the dimensional modelling technique commonly know as the star-schema, but with the presentational simplicity of the Excel’s PivotTable and an Excel-like formula language (DAX) bolted on. PowerPivot models built with an understanding of the usefulness of the star-schema approach will allow for richer and simpler data analystics. I can provide that knowledge or help you master it.

My involvement can be either on a “deliverable” basis, i.e. I deliver the required solution, usually fixed-price for a fixed deliverable or on a consultancy basis, where the “deliverable” is more an ongoing process and would often involve a degree of skills transfer.

I excel in data, with tools such as PowerPivot so can you.

Star Schemas – to boldly go where no Excel spreadsheet has gone before

One of the many things that delights me about PowerPivot is the central role played by the Star Schema. Those of you reading with a data-warehousing background would shrug your shoulders and say: “So what, what else would you expect to find at the core of a BI tool?”.

Those from an Excel PivotTable background would ask: “What’s a Star Schema, why do we need one,what’s wrong with a the good old-fashioned single flattened table?”.

Those from a classic MOLAP background (Essbase, TM1, Palo) might also ask: “Why do we need this extra layer? Load the cube directly from the operational data model and get on with it!”.

A quick Q&A is perhaps the best way for me to explain why star schema design is a powerful skill in a datasmith’s toolset.

First off, what’s a Star Schema?

A Star Schema (also know as the dimensional model) is a denormalised (flattened) data model used to simplify an operational (OLTP) data model to better accommodate reporting and what-if analysis.

At its simplest, it consists of a central fact table with links back to a “surrounding” set of dimensional tables, hence the star name. A variation is the snow-flake schema, where the dimensional tables are not fully denormalised (e.g. Product Category->Product->Fact instead of  Product->Fact).

The role of the fact table (besides being the table that hosts most of the measure fields) is to create linkages between dimensions (such as Customer, Product, Date) usually based on an actual transactional event (e.g. Invoice) or a proposed event (such as a Budget or Forecasted Sale). In effect, simplifying  the often complex work-flow-driving connections of a typical operational system by using a single many-to-many relationship (modern ERP/CRM systems’ data models consist of scores of configurable many-to-many relationships).

Many wrongly believe the star-schema was adopted for performance reasons and now that in-memory OLAP is becoming the norm it’s no longer necessary to use dimensional modelling techniques. In fact, in the early days of data-warehousing, RDBMs had great difficulty efficiently handling star-queries (and some such as MySQL and SQLite, still do).

The original primary purpose of the star schema was to simplify the SQL required to access reporting data; to make the model more approachable to non-technical users. Of course, even simple SQL was beyond the knowledge or interest of most end-users but a sizeable proportion were happy to do so (often helped by SQL “generators” such as MicroStratery or Business Objects). But even in situations where SQL-wielding civilians were not to be found, the simplicity of the dimensional models  proved to be a valuable aid when establishing and developing the warehouse data requirements. PowerPivot requires no SQL knowledge to manipulate the dimensional model which brings the original concept full-circle but this time opening its possibilities to a much wider audience.

But surely, concentrating on the actual reports would be a more valuable requirements gathering exercise?

A so called “bottom-up approach” is often the best way to approach a reporting request particularly if the reports are simple one-off “traditional” reports. But for self-service BI, this needs to be combined with a top-down dimensional design. The idea is not to build out each and every report or indeed cube but to build a structure that’ll support likely queries. The process of building a star schema provides both a logical model and a physical implementation of that model against which potential queries can be tested. I’ve worked on several POCs destined for implementation in Essbase where the star-schema was built and potential cubes mocked up using Excel PivotTables that subsequently never went any further (except for the star-schema ETL process). The end-users derived sufficient value from the denormalised star-schema pivoted and reported in Excel.

In traditional ROLAP data-warehouses where the cubes were built directly against star-schemas, the pure logical approach to the data model often had to take a back-seat to the necessity of fine-tuning it to make response times (be that ETL or user-pivoting) acceptable. This is why I much preferred situations where the star acted as a logical model from which MOLAP cubes were built.

With PowerPivot, ROLAP has a new champion. The column-oriented high-compression in-memory architecture means that the compromises of the past are no longer necessary. The fact table reverts back to it primary role as a many-to-many connector. In a pure hypercube, measures are just another dimension (the approach that Palo takes), this is also now true of PowerPivot models; measures can be sourced from dimension tables and dimensions from fact tables as it logically should be, but without the performance hit of old.

But what’s the advantage of a star schema over a flattened table when using PowerPivot?

It is true that the same flattened table model as used to backend a PivotTable can be used within PowerPivot. But doing so would limit the potential of the DAX language to construct measures such as average sales spread over potential customers (rather than actual customers that would typically be represented on a flattened table). Also, by creating “conformed dimensions” (single cross-business views of Customer, Product etc.) and using such tables as dimensional sources for multiple fact tables, “virtual cubes” that combine values from multiple fact tables can be built.

If you’re new to dimensional modelling I’d recommend the books & articles of Ralph Kimbal as good starting point. You do have to be aware that some of the advice regarding efficiency trade-offs, surrogate keys etc. do not  apply in a PowerPivot scenario (even though other performance issues still apply) but the logical design tips still apply.

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.