In a previous post I put forward an approach to using SQLite as the basis for a Tag Cube i.e an alternative pivoting mechanism using a ‘tag’ index to enable multi-column star-queries against a central fact table. Although the appearance on the scene of PowerPivot has negated the raison d’être for “Tag Cubes”, the concept of a “Tag Index” still has relevance.
A tag index differs from a normal b-tree index in that it is optimised for multi-columnar access (OLAP type queries) rather than single-column primary/foreign key access (OLTP type queries). In essence it’s somewhat similar to columnar indexes over relational databases such as will be available in the next version of SQLServer (see here, for this and other exciting news re the future of PowerPivot in the enterprise).
SQLite doesn’t offer columnar indexes but it does have something similar; FTS3 – Full Text Search virtual tables; and this is what I use to implement TAG indexes.
The basic idea is to create an FTS3 virtual table that indexes the primary search columns of a fact table (or very large dimensional table). This virtual table is then used as a first level filter ‘index’ for the fact table. To make this simpler to use (for example, to make it possible to be used as a fact table in a SQLPower Wabit Mondrian schema) I wrap the fact table and FTS3 ‘index’ table using a virtual table of my own. An example will best explain it:
If you have a fact table such as this:
CREATE TABLE orderFacts (CustomerID, ProductID, CustomerType, InvoiceDate, Qty,Value);
To create a Tag index:
CREATE VIRTUAL TABLE orderFactsFTS using FTS3();
Populate this ‘index’ with the rowid and a space delimited concatenation of the relevant ‘key’ columns (docid & content are the default columns of an FTS3 virtual table):
INSERT into orderFactsFTS (docid,content) select rowid,CustomerID||’ ‘||ProductID||’ ‘||InvoiceDate from orderFacts;
Finally, create a ‘wrapping’ virtual table using my TAGindex virtual table function:
CREATE VIRTUAL TABLE orderFactsTAGs using TAGindex (orderFacts, orderFactsFTS,3,CustomerID,ProductID,InvoiceDate, CustomerType,Qty,Value);
The TAGIndex parameters are:
- The fact table name.
- The FTS3 ‘index’
- The number of columns that the TAGIndex virtual table will use as part of the tag index. In this case 3, as we’ve didn’t index customerType (too low a cardinality to bother, perhaps) nor the two ‘measure’ columns of Qty and Value.
- The column list from the fact table to expose via the TAGIndex virtual table. Columns that are part of the index must appear at the beginning of the list and must match in number the value specified by the 3rd argument above.
We can then access the resulting table like so:
The same result could be achieved without the TAGindex virtual table using :
There are a number of potential problems with this approach:
- What happens if there’s a product and a customer with the same 2098 key? The FTS3 filter will be less effective but the result will still be correct due to the ‘filter twice’ provided by the standard SQL WHERE predicates. Obviously it would be better in that case to add new columns to the table to ensure uniqueness of keys across key domains but it’s not essential.
- If a key value begins with a “-” the FTS3 engine will treat that as “not =” resulting in an incorrect result. Unlikely, but if likely, same solution as above, new column to doctor the key to suit.
- High cardinality amongst key values, e.g. Invoice number, causing a very large, potentially slow, index. This is the same problem that columnar-store indexes face. Best not index such columns if index bloat occurs but it most cases (e.g. such columns being a minority) this shouldn’t be a problem.
So what are the potential use cases for TAGindex tables?
- Provide fact table indexing for classic PivotTable “cubes” data backed by microETL add-in.
- Ditto for SQLPower SQLite-backed Wabit cubes.
- Ditto for SQLite3.exe reporting.
- Allow for SQL powered discovery/validation of very large datasets whether using microETL or simply SQLite3.exe at the command line. Even if using PowerPivot, in can be useful to validate results using an alternate approach. Also, using “group_concat” aggreagate function, it’s possible to drill-thru to detail level, something not possible using PowerPivot.
Here’s the code of the TAGindex virtual table function (implemented as a SQLite extension):
And, here’s a already built dll (in SQLite3.exe issue “.load TAGindex.dll” command before using):
For another example of a SQLite virtual table (this time combining Expat XML streaming parser with SQLite) see http://blog.gobansaor.com/2010/06/16/sqlite-xml-streaming-virtual-table-via-expat/