Palo OLAP and sparse dimensions.

Last week I tried out both the latest Palo 2.5 release and its sister product, ETL-Server.  Although I’ve not done any proper benchmarks, 2.5 does appear to be faster than the previous release and the Excel add-in also behaves better when co-habiting with other add-ins and macros (the previous release’s use of, and response to, application level events meant it often caused the entire Excel session to grind to a halt when other macros were processing data).

There is however, a major memory leak problem when using cubes with sparse dimensions (such as the Biker database Invoice dimension), at one stage my Excel session had a working set size of 750M!

The failure of  MOLAP cubes to effectively handle sparse datasets has always been something that ROLAP star schema advocates (myself included) have pointed to as a major short-coming.  Over time, products such as Essbase have managed to get around this limitation (but often only with careful up front cube design by skilled staff). Palo hasn’t quite made it to that level yet.  But then, maybe it shouldn’t, perhaps simplicity of setup and use should be Palo’s goal, not the ability to handle a telco’s ‘grain-at the-level-of-call’ fact table.

Two of the new features in Palo 2.5 that I’d been looking forward to, Zero Suppression and Drill-through are both relevant to the handling of cube sparsity.

  • Drill-through; to reduce the sparseness of a dimension, by moving the ‘grain’ up a level in the consolidation hierarchy, e.g. have a base level of Invoice Type instead of Invoice Number or indeed, dropping the dimension entirely.
  • Zero Suppress; to filter out excessive elements, e.g. only show Invoice Numbers that were raised this month rather than showing every invoice number ever raised depending on a non-zero value to indicate that it belongs to this month.

It was my testing of Zero Suppression using the Biker database that appears to have caused Excel to go on a memory binge.  It doesn’t worry me that much as the right design can reduce the need for zero suppression e.g. in a customer dimension, ‘hide’  customers who trade with you infrequently under a “Others” consolidation.  Also, the use of drill-through would eliminate the need for many sparse dimensions, but, alas, the drill-through functionality is severely nobbled in Palo 2.5.

Firstly, the functionality is only available in Excel if you purchase the €8,000 Palo Supervision Server which is all fine and dandy if you have a need for all the good things that Supervision Server offers.  Having said that, I did managed to bypass this requirement by calling the ETL-Server’s SOAP API directly from Excel with the help of Simon Fell’s superb PocketSoap library, so all is not lost on that front. But …

… the other reason I’m not overly impressed with the feature, is its implementation; particularly for something Jedox is asking you to pay €8000 for!   Here’s what you get.

Within the ETL-Server’s cube ‘export’ you can specify “<info>” fields such as invoice number, that will not form part of the final cube coordinate list.  These coordinates and the info fields along with ‘value’ field are then written to a Derby (aka JavaDB) database table, where the schema is the ETL project name and the table name matches that of the cube.  So you end up with, in essence, a ‘fact table’ at a finer ‘grain’ than the cube and the role of degenerate dimensions being provided by the “info” fields.  The resulting table is not indexed, so large datasets will be a problem and for it to work, a cube can only be ‘owned’ by a single ETL project.

The other thing to note is that drill-thru only works for ‘N’ coordinates i.e. no consolidation elements.  This is unlike Excel’s Pivot table which allows you to double-click at any cell, to reveal the underlying dataset.

On the plus side, the new ETL-Server is actually very good and well put together.  Better off not thinking of it as a generic ETL tool but as a specialist Palo loader, like Oracle’s SQL*Loader.  And like SQL*Loader’s role on many Oracle projects, for a large percentage of Palo projects ETL-Server will  be the only ETL tool needed.

Advertisements

9 responses to “Palo OLAP and sparse dimensions.

  1. Hi!
    We will address this memory consumption issue in our next release (comming out on June 25th). The problem lies in the way and the amount of data we transfer between our core client library (libpalo_ng) and the .NET Excel frontend. There are some issues with the garbage collector in .NET that we try to workaround.

    Vlado,
    Palo Team

  2. Andreas Fröhlich

    Hello,
    concerning the Drill-through, take a look at this topic on the Palo-Forum:
    http://forum.palo.net/thread.php?threadid=1046

    Andreas,
    Palo Team

  3. @Vladislav

    Excellent news, looking forward to June 25th.

    @Andreas

    I’ve seen it, I’m the “gleesoto” who raised the question 😉

    I still think you are going to have to allow for drill-through from consolidated coordinates, forcing users (who undoubtedly will be comparing the experience to Excel Pivot tables) to drill down to the base level of every single dimension is (a) unnecessary and (b) a huge inconvenience to the user.

    Yes, such a drill-through might occasionally return a result-set of 100s of thousands, but that could be protected against. The argument that the required indexes on the drill-through “fact table” would slow down the load process could be addressed by using a faster database (SQLite!!!) and/or using an intermediate temp table during load (good idea anyway, allow for ETL process to happen, while drill-through functionality continues), then transfer to a “latest version” table, batch build indexes, finally point users to latest version. (This would also of course require Palo database versioning as well).

    Nevertheless, as I’ve said before, the ETL Server is impressive. Keep up the good work.

    Tom

  4. Hi, interesting as sparsity handling is key to OLAP models – it’s what makes Cognos’s TM1 so good.

    In my case, what I’m evaluating in Palo is the handling of sparse data and rules/calcs.

    I’ve added a duplicate cube in ‘Biker’ to the orders cube that multiplies the Units measure (at N:) by a notional currency value – simulating a currency translation. This -could- result in a mass calculation of every cell in the cube unless the sparsity is handled and it only calcs those cells with data in.

    I’m waiting for the results as I read this, but it’s looking slow so far…

  5. @Jason

    Make sure you use “markers” otherwise it will result in a mass calculation.

    i.e.

    rather than

    [‘EuroValue’] = N:[‘Units’] * [‘EuroRate’]

    use.

    [‘EuroValue’] = N:[[‘Units’]] * [‘EuroRate’]

    this will only calculate where Units are set.

    Tom

  6. @Tom, thanks for that. As it happens I found that bit out – I like it, simple.
    The next question was how to do this from another cube – where I’m using a PALO.DATA() reference that needs to be the a ‘marker’.

    I.E. [‘EuroValue’] = N: PALO.DATA (“Cube1”..) * PALO.DATA (“Currency”..)

    It’s possible that this isn’t the best way to build in Palo, but I’m comparing to TM1 where ‘feeders’ are often pushed from a source cube to a calculation cube.

  7. @Jason

    For that you’ll need the PALO.MARKER function

    [‘EuroValue’] = N: PALO.MARKER (”Cube1″..) * PALO.DATA (”Currency”..)

    ..you can also mark PALO.DATA expressions by appending @PALO.MARKER. You’ll need this if source dimensions require functional modification (like SUBSTR for example) as MARKER parameters can only be static constants or variables …

    [‘EuroValue’] = N: PALO.DATA (”Cube1″..)@PALO.MARKER (”Cube1″..)* PALO.DATA (”Currency”..)

    Tom

  8. @Tom.

    Thanks again, I’ll do some more playing.
    Last night I re-built Biker in TM1 9.4 (for interest 81,092,880,394,240 cells with <150,000 data points – nice and sparse) for a quick sparsity test.

    A simple Currency calc (as above) would easily kill a bad sparsity handling OLAP but Palo and TM1 performed equally well – sub second response for a top level roll-up of the calculated member.

    Next I’ll extend the TM1 model to try and get the response times to something I can actually measure and try the same with Palo – suggestions welcome!

    Jason.

  9. @Jason

    Excellent.

    When I first came across Palo, I was quite happy to see it a “good enough” alternative to the established OLAP tools. Sort of like the dog who can ride a bike, not so much how well the dog could ride but that the dog could ride at all.

    No more. Palo is now a serious contender and from what I understand of future plans for the product, going to get even more capable.

    We just need more people out there evangelising it, it’s undoubtedly the best kept secret in BI today.

    Tom