In my previous post I looked at simulating a bitmap-join in SQLite using a sub-query and the INTERSECT command. The problem is of course, this is a simulation, SQLite lacks bitmap indices and although the sub-query will read only the fact table’s index B-trees (avoiding accessing the fact table proper) and should be relatively fast, it will not match the efficiency of a bitmap based query. Also, the INTERSECT operation requires the creation of a temporary table which could be very large; admittedly, a bitmap approach would also require temporary storage but the space requirement would be much less and databases such as Oracle10g have highly efficient bitmap set processing algorithms.
Another approach is to apply the method used by Oracle 7/8, i.e. join all the dimensions tables together in a ‘Cartesian product‘ join and then access the fact table using a multi-column index. This can be very efficient when the number of dimensions is low and the dimension tables are small in size. This could easily be implemented using a variation on my VBA based xLITE utilty.
Select week_no,day,dim_date_id from time_dim where dim_date_id >= '20070801' and < '20070901'
Or, first load the dimensions into excel tables, allow the user to filter the resulting lists and then load these filtered lists into :memory: tables.
Cartesian join the resulting :memory: based tables and then join the resulting dataset to the ATTACHed fact table ensuring to include each dimension’s primary key in the join.
Load the resulting dataset back into Excel and pivot.
The Cartestian join could result in a very large temporary table and although it would contain fewer records than the equivalent ‘bitmap simulation’ table, each record would be much larger ( “primary keys and selected dimensional attribute columns” Vs. “just rowids”).
Both methods would be sub optimal where the query required the reading of a large percentage of the fact table. What that percentage that is varies, but could be as low as 10%.
So, which method to use? Maybe a combination of all three, in effect, writing my own SQLite cost based optimizer! Or, more likely, manually slotting in the required access method depending on the nature of star schema. I’ve got to keep reminding myself that this is micro-BI, the data volumes will be in the megabyte or gigabyte ranges not the terabyte range, with the resulting dataset being in a format suitable for analysing via an Excel pivot table. If a pivot table looks unlikely to be capable of handling the problem (e.g. ragged hierarchies, budgeting read/write/splash requirement or very large datasets) my next port of call would be a Palo cube; in that case the SQLite star schema would simply act as the ETL staging area for subsequent loading into the MOLAP cube.