Category Archives: data

Look’s like Star Schema is back on the menu!

looks-like-meats-back-on-the-menu-boysAs the release date for Excel 2013 gets nearer (in fact Home Premium has already arrived) the pricing and package offering are becoming clear(er). For many of us the emerging picture is not pretty, mainly due to a u-turn on Microsoft’s part i.e. removing the PowerPivot add-in as an option for most low-end Office 365 packages and ALL retail versions.

Rob Collie goes into the likely back story (and politics) behind such an obviously (well to those of us at the Excel coal face) stupid move. MS not only led some of her most ardent evangelists up the proverbial “garden path” with PowerPivot under Excel 2010 (never before has a major feature been so restricted in a subsequent version) but also gave a false and misleading impression of the new Office 365 packages during the current trial period (i.e. where all versions have the PP add-in (and PowerView!) as an option). Being lied to is the expression the keeps coming to mind.

250px-Gandalf_the_White_returnsThere is, however, a silver lining, Excel 2010’s Data Model. If I had never seen the power of DAX Measures (which is, in essence, what lack of the add-in deprives you of) I would be completely bowled over by the data handling power of the Data Model (it is, in effect, the Vertipaq Columnar Engine with all its power and glory – minus the ability to create (but can consume) explicit DAX measures & columns ). But I have, so hence my disappointment 😦

But even without the add-in, the Data Model opens up a whole world of data munging potential.

At its simplest, it makes “VLookup hell” a thing of the past. This alone, as any who have laboured under the tyranny of VLookup (and her cousins Index & Match) will agree, is a huge reason to upgrade to Excel 2013.

Also, all versions of Excel (OK, not all: RT and basic Web Apps do not support the DataModel) can consume a DAX-rich model produced by somebody else who has access to the add-in. Now before you get too excited, models produced by the free and widely available 2010 version must still be upgraded by the 2013 version before they can be used by 2013’s Data Model. UPDATE:  Previous statement not quite true, it is possible to transfer the binary data file holding the model from 2010 to 2013 (but must be a V2 generated model, V1s or V2s upgraded from V1, will not work).  Have done a POC and will likely add this facility as a public command to HAMMER in the near future.

It is also possible to create DAX Measures and even Calculated Columns against all Data Model enabled versions using an AMO powered .NET add-in. Last week, being a hacker at heart, I created a POC using Python within my HAMMER micro ETL toolset to do just that. No PowerPivot add-in required to populate even the humblest Excel version with sophisticated DAX models!

And there’s more!  You also use DAX Queries to generate tabular output from a plain Data Model, see Chris Webb’s example. Again, a PP-add-in-free-zone

But why the title, why are Star Schema’s back on the menu and were they ever off the menu?

In this post (and the related posts linked to within) I’ve argued that the strict “laws and practices” of dimensional modelling (as popularised by Kimball et al.) can be relaxed (and often ignored).  The world that spawned such ‘best practices’ was not a world of DAX powered, or even MDX powered, data models, it was one where the RDBMS’s version of a hyper-cube – the start schema – ruled.

The only tool was SQL (with its limited aggregate functions) and the single-issue star-schema was the only viable data structure. But sophisticated summary tables techniques (requiring an army of DBAs)  and new RDBMS optimizers were needed to get a reasonable response time. Plus, much of the work (and most of the cost) was in the ETL phase. That’s the world I cut my BI teeth in. How we would have wished for a modern columnar engine! And now, 20 or so years later, such a thing exists, in a spreadsheet! So dust off your Kimball Dimensional modelling books, the plain-old ETL’d star-schema is back in town!

But what about the ETL required to build such pure stars? Again, add-in-free DAX Query to the rescue! Like this, to create a Time Dimension. Dimensional data could easily be enhanced (e.g. adding ABC type attributes) using similar techniques.

Fact data could be more of a problem (due to the 1M row limit of an Excel table – the output medium of Excel-based DAX Queries). But again, if you have access to even the most basic of relational databases (like the inbuilt SQLite engine in  HAMMER) even the shaping of very large fact tables can be handled. For more demanding requirements maybe use AWS’s Data Pipeline  “ETL-in-the-cloud”?

Anybody with experience of using pure stars will of course be aware of the downside i.e. the restrictive nature of the end result. Such data models tend to deal with a limited number of “questions” especially compared to DAX-destined data models (which can be very close to their OLTP sources in structure).

But that can also be an upside, depending on the likely end-users of the “cube”. The simple single-issue pure star-schema is often better suited to an environment where most are “consumers” rather than “explorers” of data.

In any case, it would be nice to have the option to service both end-user types. So please Microsoft, at the very minimum, revert back to the 2010 scenario i.e. PowerPivot as  downloadable add-in. If you do, the phrase “Oh, it’s just Excel” will for the history books.

UPDATE: Aug 2013

MS have seen sense! Excel 2013 Standalone Retail version, now comes with both PowerPivot and Power Query (aka Data Explorer) add-ins. http://support.microsoft.com/kb/2817425

 

Advertisements

The Civilian Datasmith – a live specimen.

A live specimen - not a datasmith but a related data consuming species: The PacMan

If you’ve been following my ramblings posts for any length of time you’ll have come across the term “civilian datasmith” quite a lot. For example, in my last two posts on DAX I’ve been looking at whether they’re likely to use PowerPivot’s DAX.

Professional datasmiths, like myself, will have little choice but to learn and adopt DAX if we intend to continue to service either the high-end Excel management information market or the general MS BI tools market, but will our civilian peers follow suit? I think many will, particularly those who are already comfortable with high-end “excel formula programming” or those who already have extensive SQL and/or data modelling experience. But who are these non-IT datasmiths?

Well, they’re very unlikely to have data anywhere in their job titles and often share their job titles with dataphobic colleagues. But you’ll know them, they’re usually the “go-to-guys/gals” for all sort of data analysis/integration/discovery tasks and they tend to favour making decisions based on hard facts rather than gut instinct. They’ve learned to make use of whatever data and whatever tools are at hand, while others complain about tools/systems not doing what they want or IT not reading their minds, datasmiths just get on with making things happen.

An example of a real live civilian datasmith in action is Richard Baxter of SEOGadget. I know Richard from Excel development work he’s commissioned from me, including this Google Adwords Excel Add-in that appears to have been all the rage at this year’s MozCon conference.

I was at first somewhat surprised that an SEO professional was  commissioning Excel add-ins, as the typical SEO tools tend to be web-based (usually by default I think,rather than by rational choice), but I soon realised I’d found yet another datasmith enabled profession (to add to the more typical accountants, engineers, sales & marketing managers etc.).

Richard’s a classic datasmith.  To see him in action  check out the slides of his well received MozCon presentation. Note how he sources data from multiple sources but uses Excel to merge and shape the data (one other tool I’d suggest he should check out would be my HAMMER, ideal when multiple tables need to be queried,merged, cross-joined etc.).

So if you’re an SEO pro (or indeed any data dependent professional) struggling with too much data, or worrying that decisions are been made based on too little data, check out the following:

  1. Microsoft Excel for SEO http://www.distilled.net/excel-for-seo/ not just for SEO types, an excellent run down on the Excel skills any would-be datasmith must have.
  2. See Richard’s Adwords Add-in for Excel, a good example of the sort of thing Excel can do through the power of custom-coded add-ins. (UPDATE: Here’s a video of the tool in action)
  3. Then check out HAMMER, my swiss-army-knife-like  datasmithing tool for those whose working lives revolve around  data tables (primarily an Excel add-in for use as an array formula, but can also be called from VBA and via the command-line (aka DOS box)) .



DAX the new noSQL?

In my previous post Dax the new SQL I argued that perhaps DAX would replace SQL as the database query language, in particular for those I term “civilian datasmiths”. But perhaps I should have hailed DAX as a new contender in the world of noSQL.

When I started in this business in the late 1970s, the database world I encountered was a noSQL one. Database structures were in the main laid out by DBAs using either hierarchical (e.g. MUMPS) or network topologies (the various CODASYL offerings). Programmers would then navigate these predetermined access paths using a cacophony of APIs and query languages. It worked quite well, and if fact many developers found the move to a SQL world difficult, as they regarded the pure relational model to be sub-optimal when developing applications (and many still do!).

But there were two main problems with such databases:

  • Although the pre-defined access paths worked well for development, they were often next to useless for downstream reporting, which usually meant that data had to be dumped to files for “reshaping” and de-normalisation, with resulting flattened data often being consumed by early end-user reporting tools such as Datatrieve. (Sound familiar?)
  • Likewise, although the original data design tended to be ideal for the original application design, downstream changes in requirements were not easy to incorporate. DBAs were busy, expensive and usually very grumpy, which meant that many enhancements, as with their reporting cousins, had to make do with using external files.

The SQL era was a huge improvement. Nevertheless, many developers found the impedance mismatch between this relational SQL world and their own  “networked”  object-oriented world to be a continuing problem, leading to the constant search for the perfect ORM and eventually the re-emergence of noSQL database engines.

PowerPivot’s VertiPaq engine is one of these new noSQL databases. It resembles those pre-SQL databases in that its access and relationships paths are largely determined not by its DAX query language but by whoever builds the database in the 1st place. The big difference is of course, PowerPivot’s primary (and only) use-case is reporting while most  databases have in the past targeted system datastorage as their primary purpose in life, with reporting as a by-the-way. The other difference is that the creators of PowerPivot “databases” are less likely to be DBAs and more likely to be end-users or at least those in IT, like data/business analysts, who are close to the business i.e. more common on the ground, less expensive and hopefully less grumpy. Indeed many civilian DAX practitioners will have the luxury of being able to populate their reporting models with suitable datasets that have already been constructed by existing data warehousing systems.

It’s this separation of filtering- aggregation-transformation from topology that I think will help endear DAX to a less technical audience that has been the case with SQL. MDX also had this concept, but the underlying “more natural” tabular nature (business users live by lists!) that DAX builds on, will I think,be easier for many to engage with.

I used “I think” at lot in the previous paragraph as it’s very difficult for somebody like myself (a professional programmer and veteran of many a database mind-share war) to know how non-technical folks will actually view something like DAX. Although my career has been one largely based on business-user interaction, I’m still a techie at heart.

So what do you civilian datasmiths think? DAX, yep I’ll learn it, or nah, that’s what those geeks down in the basement are for!

HAMMER a new Excel ETL tool for the PowerPivot age …

So, why am I developing this new datasmithing tool when I already have microETL to do the same task?

To be honest, when I started, my aim was to simply port  a lot of my useful framework code to .NET as I’m doing more and more work in pure .NET these days.

It’s a lot easier to do so with a project driving the work rather than the drudgery of line by line conversions simply for the sake of conversion.

But over the last few months of investigating the process of moving some of more useful code over to .NET; discovering C#-SQLite as a near drop-in replacement for SQLite; realising that IronPython 2.7 marks the coming-of-age of Python under .NET; discovering the ease of multi-threading programming under Net 4.o; I began to see the potential of building a tool more suited to this new emerging world of technological plenty; of 64bit, multi-core high-RAM machines playing host to ground-breaking tools such as PowerPivot, than the 32bit, single-threaded  world that microETL was born into.

I wanted to bring microETL back to its roots as an in-cell function. To make it sit more easily with the spreadsheet functional style of programming, a style of programming that has been more successful at attracting “civilians” to programming that any other method. To make the tool more in tune with the Excel-way of doing things.

At the same time, I also wanted the tool to be “detachable” from Excel, so that it could perform ETL functions server-side without the need for Excel. Ideally capable of being run from the command-line from a single no-install-required executable.

And so, the Datasmith’s HAMMER was born.

So where does this fit in with PowerPivot? When I design something like HAMMER I normally craft it around a serious of “use cases”. Below would be a one of my PowerPivot & HAMMER  use cases:

The Data Quality Team

The sales & marketing department of a wholesaling company have discovered PowerPivot and are excited about using it to help manage an upcoming merger with a recently purchased competitor, also in the wholesaling business. There’s a large overlap between customers and products supplied, and the sales structures, pricing and KPIs used by both companies are very different. The transition phase from two separate companies to a single operating unit will need detailed  planning and constant monitoring.  Oh, and there’s also the problem of the brand new ERP system that’ll replace the existing systems currently in use.

The group can see how PowerPivot models will help with this task but are worried about the sourcing and managing the data. They decide to appoint a small data quality team, with a mixture of IT and business experience; which will be responsible for mapping old  to old,old to new, managing the data deltas (business will still goes on, new customers, new products etc.).

Most of this work revolves around “master data”. Transactional data may be much larger in volume, but get the master data right then transactional data will be easy to handle as long as the processing capacity to handle the volumes is available (which thanks to PowerPivot and really fast high-RAM PCs it is).

In golf, there’s a saying “You drive for show, but you putt for dough”. Likewise in data: a lot of emphasis is put on the big-data transactional datastores, but real analytical benefit comes from the less sexier master-data, aka the dimensions. As a result the most important output from the data quality team will be a set of conformed dimensions.

Each week (and at month-end) the team will get extracts of the master and transactional data in various formats from both legacy systems and will also get the WIP master datasets from the team working on the new ERP system. From these they’ll construct point-in-time conformed-dimensions combined old with old and old with new; and will rebuild transactional feeds with new conformed keys alongside existing keys. These datatsets are then sent to the various analysis teams to enable them build Excel and PowerPivot models that will hopefully all “sing from the same hymn sheet”.

And how will this be accomplished? With the datasmith’s HAMMER of course (well it is my day-dream use-case!).  Both in-Excel and command-line HAMMER “scripts” will be used to wrangle the datasets and the complete sets packaged as SQLite database files.

When the analysis groups receive these files, they too will use HAMMER to extract the datasets they require (either straight to CSV for loading into PowerPivot) or into Excel for some further transformation work prior to modelling. To handle slowing-changing dimension scenarios, many of the teams will squirrel away each week’s data in other to be able to model yet-unknown-but-very-likely changes of organisational structures.

Although this is a use-case, it’s based on reality, a reality that would have been so much easier to manage if I, and the teams I worked with, had access to technology such as HAMMER and PowerPivot.

To download the latest version of the code, go to this page on my website.

Follow the HAMMER tag on  this blog for information on commands and examples (best start with the oldest and work forward …)

S3 as an Excel hub

In my Steam Powered PowerPivot post I explained the concept of powered-spoke based Excel systems. One of the hub options was Amazon’s Simple Storage Service, S3. I’ve been using S3 both as a simple backup service and as a “systems hub” since the service began in 2006. Back then, it was not necessarily the easiest to use or the most suitable, but it was cheap and reliable. The code I used to access S3 from Excel can be found here http://blog.gobansaor.com/2008/02/22/xlaws-excel-vba-code-for-accessing-amazons-s3-and-simpledb/

Over the years, lots of new features have been added to S3 that have made the service more useful and easier. The final big missing, for me at least, was the lack of multiple  authorisation credentials  for a single account i.e. the ability to set-up multiple users with their own access policies. Last year, with the launch of IAM (Identity & Access Management) Service, that last major problem was removed.

Although currently IAM is accessible only by the API that’s not a problem as my favourite tool for managing S3 (and AWS in general) is Python’s boto package, and it has that, and everything else you might wish to do S3 -wise, covered.

Although Boto does most of my heavy lifting (particularly on the server-side), I still have need for a VBA based S3 utility. Boto is fine for situations where I’m using microETL with Python enabled, but in many situations this is not ideal or even possible, particularly on the client-side of systems. Here, even a single file add-in may not be possible which often means a macro-enabled workbook is the only option.

MicroETL has a noSQL compile option which enables a lot of the code to operate without the SQLite libraries, thus allowing it to be used as a single-file VBA add-in or embedded in a workbook.

Up until now my S3 code depended on the xliteSha1Hmac.dll to provide SHA1-HMAC hashing (necessary to authorise access to S3) meaning that only pre-signed S3 URLS could be used in single-file deliverables. Even if this was not the case, prior to IAM, pre-signed URLs (usually with time limits) were often the only option as sharing an S3 account’s single authorisation credentials with all and sundry was not ideal from a security point of view. But post IAM, being able to set-up sub-accounts with specific access policies means the full power of S3 can now be used at client-level.

I’m in the process of adding a pure VBA HMAC-SHA1 facility (32 bit only at the moment, 64 bit continues to require the xliteSha1Hmac.dll). I’m also adding a JSON encode/decode module to allow tables to be more easily shared with non-Excel programs, currently I default to using tables encoded using ADO’s export XML schema to share via S3, but the JSON option will make it easier to share with a pure Python server (such as a PiClound based service, see my Expand Excel’s Horizons – look to the cloud post).

Cloud-shared tables can not only pass datasets back and forth but can pass SQLScripts (which are, after all, simply tables) for processing elsewhere.

For example, client workbooks not having Python  (or SQLite ) functionality installed could write a script, post it to S3; that script table could then be picked up by a “server” microETL workbook (or by a pure Python server), executed, and the results posted back to S3 to be picked up by the client (possibly waiting a response using a TIMER).

Server processes could also do the opposite; passing down SQLScripts for client-side execution.

When people think of utilising the cloud, this sort of scenario is not what comes to mind, thinking that Excel is a desktop product so it cannot take advantage of the enormous cost-advantages of cloud-based solutions. Not so, it’s now very easy to integrate the likes of S3 into existing processes, adding that cloud magic but keeping those existing works-so-don’t-fix-it processes firmly on the ground.

If you need help figuring out how to take advantage of S3 (or other AWS services) and how to integrate them with your existing technology base do contact me. I’ve been working with this technology for over 5 years (and have 30 odd years of experience with good-old reliable “ground-based systems”,

What’s with the bucket picture? S3 calls its primary data storage unit a bucket.

UPDATE:

IAM is now available from the AWS Management Console http://aws.typepad.com/aws/2011/05/identity-and-access-management-console-support.html