… and Excel as its “mixing desk”.
When I tell people that I use SQLite in combination with Excel (via microETL and now via my latest offering The Datasmith’s HAMMER) as my datasmithing platform, many ask why SQLite? (Many others ask why Excel? but “sin scéal eile”, that’s another discussion – Excel as the iPod of Downloaded Data.) Those that question my use of SQLite tend to cluster into four camps:
- Pure Excel jocks.
- MS Access fans.
- The client server database brigade (SQL Server,Oracle; or if FOSS fans; MySQL, PostrgeSQL).
- The MOLAP folks (Essbase, TM1, Palo).
Now while I have used and will continue to use/encounter all four ‘approaches’, I’ve come to believe over the last couple of years that SQLite brings something special to the datasmithing game. When I look back over nearly 30 years in the data handling business I keep thinking – “If only I had SQLite then, how much easier/quicker/cheaper that task would have been!”.
Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.
As operational local caches, this use of SQLite is already far advanced. SQLite is embedded in lots of every day software tools, everything from McAfee anti-virus to TweetDeck Twitter clients (best one IMHO). But my interest is more in SQLite’s potential as a micro-BI (or maybe more correctly a distributed-BI) platform. A sort of MP3 format for distributed structured data, if you like.
But why SQLite (and in particular SQLite in combination with Excel) as my datasmithing tool rather than the four other approaches? First, what’s a datasmith?
Managing and manipulating datasets has become an integral part of many people’s job, not just accountants (the original of the species) but marketing executives, sales staff, pricing analysts, process engineers; different job titles, different roles but using a skill that they’ve likely never been formally trained in, a skill without a name; a skill I call datasmithing. I like to think of myself as a master datasmith, or a datamith’s datasmith.
If you consider yourself a datasmith then most likely the tool you use to manage your datasets is Excel. And before you apologise, don’t. Excel is by far the best and most flexible end-user data manipulation tool out there. Everything from the current financial crisis downwards has at some stage being blamed on Excel, but you know and I know that many tasks would remain undone or under-done were it not for end-user generated spreadsheets.
Spreadsheets are not however optimal for some tasks, linked spreadsheets in particular are data disasters in waiting. While fantastic for data transformations and presentations, as books-of-record they’re rarely suitable. Other tools such as SQL based relational databases and in-memory OLAP offer much better and potentially much more cost-effective data modelling functionality, but also at a cost of extra complexity and ongoing technical support.
MS Access (which like SQLite, is a document-centric, non-client-server database; but unlike it, is also a forms/reporting development environment) would appear to be the natural local store database. My problem with MS Access has been its tendency to try to be all things to all men, ending up not fully satisfying anybody. Professional developers think it’s too limiting, non-techs find it too intimidating, even reporting, where it once showed promise left a big enough opening for Crystal Reports to evolve. It is also limited to Windows which might not seem to be a problem if combining with Excel, but, as it’s often necessary, due to scale or complexity of the data, to use ‘proper’ ETL tools such as Talend, having an OS agnostic database format than can act as a distribution media (think MP3s again) between “mixing desks” can be very useful.
The big difference to MS Access for me is SQLite’s open source code; code that’s a pleasure to browse and with an approachable API that even I, with my very rusty C skills, can manipulate. Having access to that code allows me to tightly integrate it with Excel, so much so, that I can use Excel functions (built-in functions, VBA user-defined functions and 3rd party add-in functions) directly from SQLite’s SQL; and vice-versa, access SQL functionality via Excel “formula” calls. It is also possible to load most datasets into memory using SQLite’s in-memory mode enabling very fast processing and near zero-latency when passing data to and from Excel/VBA. In the near future, cheap, large SSDs will enable non-memory databases to offer similar speed but also handle extremely large datasets (see this for a glimpse of that future).
What about the big beasts of the data world, the client-server databases? Having spent most of my professional life working with such tools I’m aware of the power of a well designed relational database. If SQLite is the MP3, then these are the master tapes, the DDD recordings. Most of the data that eventual ends up in SQLite for analysis and/or transformation will have originated in data-warehouses or be directly sourced from OLTP systems built using relational technology. But for close-up analysis and transformation, the pure simplicity and convenience of SQLite is hard to beat. That simplicity is primarily due to its Excel-like ‘document’ nature, all code and data can be housed in a single folder (or true-crypt container for added security), ensuring that the ‘problem domain’ can be easily archived and/or shared with others without the need for professional IT resources.
And yes, I hear you, isn’t that the basis of Excel-hell? Yes it is, but over the years I’ve found that this is rarely a problem for datasmiths, they deal day-in day-out with document work-flows, they understand the risks and the benefits (mainly the simplicity) of the approach. Where the nightmare truly happens is when this approach is used as an alternative to an OLTP system i.e. using Excel and other document-like datastores as books-of-record in large multi-user environments – “there be monsters for sure”.
How about MOLAP? Wasn’t Essbase’s name derived from “extended spreadsheet database” and doesn’t Palo offer a truly excel-friendly multi-user database back-end? Again, having worked with Essbase for many years and now being a big fan of the open source Palo MOLAP tool, I fully appreciate the power that such tools brings to analysis and multi-user planning tasks. But for many situations, an Excel Pivot Table is “good enough” and even when it’s not, it is possible by utilising what I call a tOLAP cube (essentially, a fact table indexed via tags enabled by Google’s great addition to SQLite, the FTS3 virtual table) to build and access powerful, yet simple, cube-like data structures [update: Oct 2010] there’s now the option of using Excel 2010’s magnificent PowerPivot add-in.
By integrating SQLite with Excel, datasmiths can have the best of both worlds, familiar spreadsheet front-end combined with a fast and powerful SQL engine and datastore, in fact, everything that MS Access should have been.
Why not join me on Twitter at gobansaor?