SQLite as the MP3 of data

… and Excel as its “mixing desk”.

When I tell people that I use SQLite in combination with Excel (via microETL and now via my latest offering The Datasmith’s HAMMER) as my datasmithing platform, many ask why SQLite? (Many others ask why Excel?  but “sin scéal eile”, that’s another discussion – Excel as the iPod of Downloaded Data.) Those that question my use of SQLite tend to cluster into four camps:

  • Pure Excel jocks.
  • MS Access fans.
  • The client server database brigade (SQL Server,Oracle; or if FOSS fans; MySQL, PostrgeSQL).
  • The MOLAP folks (Essbase, TM1, Palo).

Now while I have used and will continue to use/encounter all four ‘approaches’, I’ve come to believe over the last couple of years that SQLite brings something special to the datasmithing game. When I look back over nearly 30 years in the data handling business I keep thinking – “If only I had SQLite then, how much easier/quicker/cheaper that task would have been!”.

Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.

As operational local caches, this use of SQLite is already far advanced. SQLite is embedded in lots of every day software tools, everything from McAfee anti-virus to TweetDeck Twitter clients (best one IMHO). But my interest is more in SQLite’s potential as a micro-BI (or maybe more correctly a distributed-BI) platform. A sort of MP3 format for distributed structured data, if you like.

But why SQLite (and in particular SQLite in combination with Excel) as my datasmithing tool rather than the four other approaches?  First, what’s a datasmith?

Managing and manipulating datasets has become an integral part of many people’s job, not just accountants (the original of the species) but marketing executives, sales staff, pricing analysts, process engineers; different job titles, different roles but using a skill that they’ve likely never been formally trained in, a skill without a name; a skill I call datasmithing. I like to think of  myself as a master datasmith, or a datamith’s datasmith.

If you consider yourself a datasmith then most likely the tool you use to manage your datasets is Excel. And before you apologise, don’t. Excel is by far the best and most flexible end-user data manipulation tool out there. Everything from the current financial crisis downwards has at some stage being blamed on Excel, but you know and I know that many tasks would remain undone or under-done were it not for end-user generated spreadsheets.

Spreadsheets are not however optimal for some tasks, linked spreadsheets in particular are data disasters in waiting. While fantastic for data transformations and presentations, as books-of-record they’re rarely suitable. Other tools such as SQL based relational databases and in-memory OLAP offer much better and potentially much more cost-effective data modelling functionality, but also at a cost of extra complexity and ongoing technical support.

MS Access (which like SQLite, is a document-centric, non-client-server database; but unlike it, is also a forms/reporting development environment) would appear to be the natural local store database. My problem with MS Access has been its tendency to try to be all things to all men, ending up not fully satisfying anybody. Professional developers think it’s too limiting, non-techs find it too intimidating, even reporting, where it once showed promise left a big enough opening for Crystal Reports to evolve. It is also limited to Windows which might not seem to be a problem if combining with Excel, but, as it’s often necessary, due to scale or complexity of the data,  to use ‘proper’ ETL tools such as Talend, having an OS agnostic database format than can act as a distribution media (think MP3s again) between “mixing desks” can be very useful.

The big difference to MS Access for me is SQLite’s open source code; code that’s a pleasure to browse and with an approachable API that even I, with my very rusty C skills, can manipulate. Having access to that code allows me to tightly integrate it with Excel, so much so, that I can use Excel functions (built-in functions, VBA user-defined functions and 3rd party add-in functions) directly from SQLite’s SQL; and vice-versa, access SQL functionality via Excel “formula” calls. It is  also possible to  load most datasets into memory using SQLite’s in-memory mode enabling very fast processing  and near zero-latency when passing data to and from Excel/VBA. In the near future, cheap, large SSDs will enable non-memory databases to offer similar speed but also handle extremely large datasets (see this for a glimpse of that future).

What about the big beasts of the data world, the client-server databases? Having spent most of my professional life working with such tools I’m aware of the power of a well designed relational database. If SQLite is the MP3, then these are the master tapes, the DDD recordings. Most of the data that eventual ends up in SQLite for analysis and/or transformation will have originated in data-warehouses or be directly sourced  from OLTP systems built using relational technology. But for close-up analysis and transformation, the pure simplicity and convenience of SQLite is hard to beat. That simplicity is primarily due to its Excel-like ‘document’ nature, all code and data can be housed in a single folder (or true-crypt container for added security), ensuring that the ‘problem domain’ can be easily archived and/or shared with others without the need for professional IT resources.

And yes, I hear you, isn’t that the basis of Excel-hell? Yes it is, but over the years I’ve found that this is rarely a problem for datasmiths, they deal day-in day-out with document work-flows, they understand the risks and the benefits (mainly the simplicity) of the approach. Where the nightmare truly happens is when this approach is used as an alternative to an OLTP system i.e. using Excel and other document-like datastores as books-of-record in large multi-user environments – “there be monsters for sure”.

How about MOLAP? Wasn’t Essbase’s name derived from “extended spreadsheet database” and doesn’t Palo offer a truly excel-friendly multi-user database back-end? Again, having worked with Essbase for many years and now being a big fan of the open source Palo MOLAP tool, I fully appreciate the power that such tools brings to analysis and multi-user planning tasks. But for many situations, an Excel Pivot Table is “good enough” and even when it’s not, it is possible by utilising what I call a tOLAP cube (essentially, a fact table indexed via tags enabled by Google’s great addition to SQLite, the FTS3 virtual table) to build and access  powerful, yet simple, cube-like data structures  [update: Oct 2010] there’s now the option of using Excel 2010’s magnificent PowerPivot add-in.

By integrating SQLite with Excel, datasmiths can have the best of both worlds, familiar spreadsheet front-end combined with a fast and powerful SQL engine and datastore, in fact, everything that MS Access should have been.

Why not join me on Twitter at gobansaor?

Advertisements

23 responses to “SQLite as the MP3 of data

  1. Pingback: BotchagalupeMarks for March 15th - 12:57 | IT Management and Cloud Blog

  2. Nice post mate.

    I was thinking along the same lines excel with SQLite for a little project I am doing, nice to get some professional/peer support.
    Anything like MySQL even is overkill for this.

    Have Fun

    • Hi Paul,

      Glad to be able to offer moral support 🙂

      My 1st instinct is now to default to SQLite unless there’s a very good reason not to. Mind you, in the last few days I’ve had a demo of Project Gemini a free add-on to the next version of Excel (late 2010 by all accounts). Described as XXL, it’s essential a Pivot Table on steroids, the demo I saw was easily handling 20m rows on 4GB PC. Powerful stuff, and very end-user friendly. When it appears and if it delivers on its promise I’ll be using SQLite less, but will still use SQLite as an ETL and data cleansing tool and of course for situations where older Excel versions are still in use (usually a large majority of sites, but maybe less in future as Gemini is the first ‘improvement’ I’ve seen in a long time that would convince me to quickly upgrade to latest version of Excel).

      Tom

  3. Pingback: Project Gemini - XXL, Excel on Steroids « Gobán Saor

  4. Very interesting, including following the links.

    Back around the turn of the century, I spent a couple of years pushing cost accounting data around in Excel and Access with the added benefit of some Perl scripts and other stuff. I can see where SQLite would be a very interesting alternative to Access within that mix.

    These days, I’m on to other things, but a friend has moved into auditing and might find this information intriguing; whether they have to chops to develop upon it themselves is an open question, but their work might benefit from such a toolset. Passing the link on to him.

  5. Sweet post. I am also a data dude and have become enamored of SQLite since making acquaintance with it over the past year in my Mac and Python travels. There are definitely situations where you don’t need the full blown database server (even MySQL as mentioned above), and Access is definitely not the answer.

    Hadn’t thought of the MP3 analogy, though…

    • @SDC

      I picked on the MP3 analogy to help others see SQLite not simply in comparison to MySQL, or Oracle etc., as a poor man’s RDBMS, but as powerful, yet simple, tool in its own right.

      Tom

  6. @PB

    Glad you found it interesting, SQLite is used by many simply from the command line, all you need is a bit of SQL knowledge and the know how to load and unload CSV files etc.

    Linking it more tightly into Excel gives it some added power and one of these days I’ll get around to publishing how to do it.

    Tom

  7. Not that I would have you change or methods, but have you had a look at SQL Server CE? It’s something I have been considering, but haven’t had a chance to play with.

    It is not cross platform like SQLite, but from what I understand it is quite similar. I would be interested in your feedback.


    E

  8. @Eric

    Yes CE is a fine product, but, and this is a big but :), CE ain’t open source. I love how I can spend a few spare hours on a weekend to investigate and extend SQLite (new virtual tables, new functions, modify SQLite3.exe to fit in with a Hadoop streaming scenario, whatever), it’s not the free that matters it’s the open source. A bit like democracy, it too isn’t free (taxes and all that) but its open and fair.

    Tom

  9. I do consulting as a BI/data analyst outside of regular work hours and Tableau has been a wonderful tool for me to show to clients. The purchase price is a great barrier to entry for most hobbyists though. It’s purely a tool for analysis and not storage so you’ll still need 3rd party data sources such as excel or sql server to base reports and dashboards on.

    Reports that would take me hours to do through sql server reporting services would take minutes with this tool and the interfaces for drilling down, filtering, highlighting, etc. are such a pleasure to use that I dread going back to the 9-5’er and using the tools there.

  10. @A Malong

    Yes, Tableau is a fantastic product, I used it briefly a few years ago and was very impressed. But as you say the cost is at a level that would either require IT/Finance approval in large companies or hit the “Well, Seán in accounts can produce pretty reports with Excel, why do you need to spend so much money on this..” barrier in smaller organisations.

    Tom

  11. ————–
    Linking it more tightly into Excel gives it some added power and one of these days I’ll get around to publishing how to do it.
    —————

    I, for one, will be waiting eagerly for these posts. Thanks for a great post.

  12. Pingback: Voice of the PPL » Pre-Order PPL 2 and save 10%

  13. Pingback: SQLite in Ubuntu 9.04 « Number Cruncher

  14. Pingback: TAG Cubes – SQLite Star Query Part III « Gobán Saor

  15. Pingback: What’s so good about SQLite? « Gobán Saor

  16. Pingback: Excel as the iPod of Downloaded Data « Gobán Saor

  17. Pingback: JavaScript as an Excel scripting language via JSDB « Gobán Saor

  18. Pingback: Excel in Data | Gobán Saor

  19. Pingback: microETL – the ultimate datasmithing tool? | Gobán Saor

  20. Pingback: Accessing SQLite databases from Excel via microETL | Gobán Saor

  21. Pingback: Those with a datasmith’s hammer, see every problem as a table. | Gobán Saor