It’s no secret that I’m a huge fan of SQLite and Excel, particularly when used in combination. I also greatly admire the open source BI engines, Palo and Mondrian. Mondrian appeals because of its “ROLAP with a cache” architecture and its implementation of MS’s excellent MDX language. When I say MDX is excellent I’m talking with my professional programmer’s hat on, as an end-user tool it’s a non-runner. This is where Palo comes in, building on the hypercube concepts pioneered by the likes of TM1 and ESSbase, it presents a designview that’s approachable by a vastly greater percentage of “civilians” than is the case with ROLAP-based solutions.
The trick behind TM1, Essbase, Palo etc. is the extension of the spreadsheet metaphor from two to multiple dimensions, while still binding the interface closely to the familiar spreadsheet (which for most of the business world is still Excel).
So where does SQLite come in all this?
At first glance, SQLite lacks the sophisticated join functionality to support star-queries, but of course, if the dataset is small then a full-table scan of a fact table, or better still, loading the fact table into memory negates any such short-comings.
In fact, all traditional ROLAP engines have problems with dimensional models, particularly when you reach the point of using summary tables or query re-writes, that’s why the emerging SQL-speaking columnar-databases are such a godsend for ROLAP data warehouses.
It was SQLite combined with Excel acting as a data prep platform that was originally my main interest, so for pivoting, Excel’s own pivot table would have to do. Nevertheless, I felt the tool was incomplete without the ability to directly pivot the underlying SQLite database.
Why not use Palo or Mondrian as a pivot tool? Well yes, where a fixed permanent “solution” is required then the extra moving parts of either approach would be justified and indeed necessary but that is to miss the essence of what I call datasmithing.
Datasmithing is not data warehousing nor is it the provision of solutions (which, for example, Palo superbly enables in multi-user budgeting situations). Datasmithing, as a skill, is of course part of the process of both, but it’s on the edges, at perhaps the planning or consumption stages.
Datasmiths deal in the unknown, in change, in disaster recovery, in systems’ commissioning, in the never-ending barely-repeatable processes thrown up by daily business life. For that, the toolset required must be as simple as possible (but no simpler), self-contained, document-oriented, secureable (is that a word?) and easily archived and retrieved. Excel and file-based DBMSs such as MSAccess or SQLite fit the bill nicely, server-based technologies such as DBA controlled database servers or IT installed “solutions”, less so.
Jedox has made Palo relatively easy to install (and likewise, Canada’s SQLPower has made Mondrian setup a painless exercise via their excellent Wabit reporting tool), but, the zero-install, email friendly document approach that spreadsheets are famous (and infamous) for, is preferable in many situations. This is something that Microsoft have recognised in their Gemini add-in for Excel 2010, but Excel 2010 is a not here yet and it’s likely to be five years or more before it’s as common as Excel 2003 is today.
The inclusion of FTS full-text searching with SQLite triggered an ah-ah moment with regards to pivot-enabling SQLite.
The usual method that hypercube-like excel-friendly OLAP tools use to return data is via a UDF like so…
…where valueN represents dimensional elements, so…
…is the Actual Profit for Beer sales in Jan 09. The dimensional elements act as “tags” to locate a particular value, there is of course much more to tools like Palo; hierarchies, intra-cube rules etc. but in essence most OLAP tools are like www.delicious.com for number crunchers. This method of retrieving data fits well with how people use Excel and not just for pivots, but for embedding OLAP aggregated cells in lists. For example, a CRM scenario; a Sales Rep makes a list of her ‘best’ (subjective) customers, but needs hard (objective) stats, to be placed alongside the list to convince the boss or to track actuals against expectation.
Dimensional elements as tags; FTS3 virtual tables as fact table indexes; the concept of a TAG Cube was born.
In the above example “Profit” would most likely be described as a measure (Palo, a near pure hypercube does not distinguish between Measure and other Dimensional coordinates). Dimensions, measures and attributes are in reality interchangeable (a Customer ID can act as a dimension or an attribute, but by applying a Count Distinct to it, it’s a measure) but most OLAP solutions treat “Measure Dimensions” as different, and so do TAG Cubes.
By using the default fact table structure (a single-columned table) and querying using the default measure (which translates to the SUM() of that single value) a ‘pure’ approach can be used. But, ROLAP is tightly bound to the concept of a fact table, and since SQLite is relational, TAG Cubes offer the ability to use a wide fact table approach and I think gains considerably in flexibility by so going.
The above example of using Count Distinct, or the simple creation of calculated measures are examples of this flexibility. Another, is a measure based on SQLite’s concat_group aggregate function to provide a drill-down facility, e.g.
…where “ROWIDList” would be setup as concat_group(rowid,’,’) and will return a comma separated list of the underlying fact table ROWIDs.
A major reason for rolling my own pivot engine was to add a concept of “namespaces” and to separate the implementation of these namespaces from the actual pivot. When a tag (or a predefined hierarchy of tags) is assigned to a cube, it’s also assigned to a namespace, in many cases namespace and cube would be synonymous, but in some cases a more sophisticated approach is required:
- Multiple cubes sharing the same set of conformed dimensions would be best served by such cubes sharing a common namespace, and so they can.
- Different consumers of the pivot may require the use of a different language, be that a spoken language or a different ‘business language’ e.g. Manufacturing Product Codes V Consumer Product Names. Again, easily done.
- Sometimes identifying data can’t be shared with the datasmith or the numerical analyst working on a problem; in such cases being able to replace the actual namespace with an obfuscated one can be very useful. Or, for added security, the namespace might only be issued to approved PCs while the tag index and fact table are stored on a shared drive. Needs some more work to make managing such scenarios secure and easy to use but the structure is there.
As hinted on above, the three elements of a Tag Cube, the namespace, tag index and fact table can be assigned to different databases (i.e. files). Due to the wonders of SQLite’s ATTACH statement and the backup API’s ability to quickly load/unload databases in/out of memory, it’s possible, for example, to load namespace and tag index (i.e. the ‘dimensions’) into a memory database, while a very large (i.e. too big to fit to memory) fact table remains on disk. Fast and cheap SSDs will add further configuration options.
Although most of the TAG Cube functionality is available only within Excel, I’ve built a C based SQLite Virtual Table (cFact) to allow the tag index to used outside xLite. This means that SQLite drivers for ODBC (for use as a Pivot Table source, for example) or JDBC (for use in SQLPower Wabit perhaps) can efficiently access data models built using xLite.
I had to revert to using C rather than my preferred Python (did I mention that xLite now embeds Python in Excel, no, well it does, Python the newVBA ?), having failed to get around multi-threading issues with callbacks to Python in both the ODBC and JDBC drivers. I’d make a career promise to myself many years ago, not to having anything to do with printers or threads, and I think I’ll stick with it 🙂
TAG Cubes are the latest addition (still WIP actually) to be added to xLite, adding to:
- VBA coded SQLite SQL functions.
- Worksheet Functions; call out to a ‘function’ built using Excel formula, passing a parameter list and returning a value.
- Workbook Functions; like Worksheet Functions, but loading a new Workbook, passing in parameters, passing back a value (or tables) and closing the Workbook when finished.
- XLiteScript; xLite exposes its functionality via VBA coded UDFs, which can be called like any other formula, but data prep activities often require sequential procedural logic, xLiteScript is a table-oriented scripting mechanism offering basic flow-control logic.
- pyScript; I embedded Python into xLite to take advantage of Python’s speed in developing Virtual Tables, SQL Functions and extensions to SQLite and to tap in the wonderful world of Python code. I’ve also added the ability to use Python from scripts defined within Excel (to indent, tab to the next cell!).
- Fast load/unload to/from CSV.
- Load from any ADO source.
- Remove xLite formulae and rename and save Workbook, very handy when used via Workbook Functions to mass produce Excel “reports”.
- Other WIP items are; load from SAP, load/unload to/from Amazon S3, use Palo cubes as TAG Cube “facts”, slot in/out Palo for TAG Cubes, auto-generate Mondrian XML based on TAG Cubes, write-back and splash, Python & VBA TAG Cube “rules”.
Why not join me on Twitter at gobansaor?