microETL – the ultimate datasmithing tool?

Over the years I’ve accumulated and written thousands of lines of VBA code, I’ve also integrated Excel with countless other technologies. Most of this work was associated with the movement and managing of datasets; data for reporting on, for analysing and for collection. MicroETL is, in effect, a collection of the best (or at least the most useful) of these activities, packaged primarily so that I don’t have to continuously re-invent the wheel, so that when I’m working on a client’s problem, I can concentrate on the problem and not the technology. So although I share the code with all and sundry with the hope that others may find it useful (little choice as it’s VBA 😉 ), its primary purpose is to the serve my needs, and indirectly, those of my clients.

VBA is of course at the heart of microETL. I’m often asked why not port the code to one of Microsoft’s more “modern” languages? There are three  main reasons for not doing so now (and one major one for not doing so in the past).

First the past. When MS introduced .NET the needs of those who program mainly around the Office platform were ignored, .NET add-ins for Excel were a pain to write, even more painful to deploy and even then, were slow compared to VBA add-ins. Things have improved, it’s now much easier to target Office from within Visual Studio (although still just as expensive). But even now, the best method of using .NET with Excel is a  superb 3rd-party open source tool, ExcelDNA. This is the tool I’ve used (and will continue to use) whenever I needed to merge Excel with the .NET platform.

I’m currently developing a .NET offshoot of microETL called HAMMER; using ExcelDNA, C#-SQLite and IronPython; not so much a replacement for microETL but a new type of tool to take advantage of the new capabilities offered by Excel 2007/2010 (multi-threading and 64bit memory addressing mainly) and .NET4 (IronPython and ease of multi-threading programming). It’s also, thanks to ExceDNAPack, a single file deployment. To keep an eye on its development follow the HAMMER tag on my blog.

So if ExcelDNA provides a pain-free (and cost-free) method of using C# or VB.NET with Excel, why stick with VBA? As I said, three reasons:

  • VBA is to Excel what JavaScript is to the web browser. It’s only remaining purpose as a language (with the death of VB6) is as an Office automation language and ,in effect, when you’re programming in Excel you’re largely working with the Excel Object model (in much the same way and the DOM is the focus of attention on JavaScript progamming). VBA is perfectly suited to this task and although it has its problems as a language (again like JavaScript) once you know them they’re not a problem. VBA is here for the long haul, Office 2010’s addition of the 64bit support to the language (Vb7 !) confirmed that. As the code is already in VBA, works very well as VBA and has a long future ahead of it, why change?
  • VBA is easy for a large subset of end-users to learn due to Excel’s macro-recording facility. The resulting code may not be pretty but it works both as a automation tool and as a learning tool. Millions of citizen programmers owe their skills to the VBA macro recorder. One of microETL’s goal is to make adding functionality as easy as possible and one such method of doing that is via simple VBA functions that can be called from SQLScript (microETL’s tabular scripting extension). C# or VB.NET are professional languages designed for professionals to use, VBA comes from a long line of “pro-am” scripting languages which open the world of programming to a much larger population.
  • Multiple deployment options. MicroETL can be installed like a normal add-in and consists of three files, the VBA add-in, the SQLite3 dll and the xLiteSQLite dll to wrap SQLite to make it VBA accessible. Python functionality and other options will add further files. But by embedding a small call-on-open macro in a workbook the add-in can be called without installation. This is how I use it for once-off tasks, as I can package the code and the associated data and workbooks in a single zip file for archiving purposes or for deployment (often running off a USB stick).  Because I can deploy my code as code-at-a-moment-in-time I can modify the codebase to suit the current requirements without affecting other microETL projects that are using, or might in the future use, a different version fo the code. In fact, microETL has a noSQL compile option that removes SQLIte and other DLL dependencies (with a reduction in power obviously) enabling the code to be embedded in a workbook allowing for a single-file deployment (which often, is the only option). This could not be done with a .NET add-in.

Okay, so VBA is useful, but why embed SQLite and Python? As I’ve said above, microETL exists to make my life easier. I’m a database programmer, I’ve been using SQL (or its predecessors ) for  three decades. I like set-based logic (such as SQL  or PowerPivot’s DAX) because:

  • I’m good at it (practice does make perfect) enabling me to solve many data related problems efficiently and more accurately (fast is no good if the result is wrong) that would be the case with functional (Excel fromulas) or procedural (code) approaches.
  • I believe anybody who spends a significant proportion of their working lives managing datasets should learn basic SQL. If they don’t it’s like driving a car but never getting beyond 2nd gear.

SQLite allows me to bring the beautiful world of set logic to user-centric Excel world of functional programming (yes, all you Excel jocks, you’re functional programmers). It also brings some nice side-effects:

  • It is, like Excel itself, an outstanding single-file document-oriented datastore.
  • By default, microETL uses SQLite in-memory database functionality (you can also attach disk-based ones) so it offers a means of exploiting the vast amount of cheap RAM that modern PCs now offer. Excel has always stored its working datasets in-memory and it too now offers much greater capacity (1,000,000 rows per sheet in Excel 2007/2010). SQLite has no practical limit other than the available memory and the addressable-limit of the OS (2GB for 32bit Windows). With 64bit Windows (microETL is Excel 64bit enabled) that addressable limit is now effectively gone. This makes microETL a superb platform for all sorts of ETL tasks.
  • As SQLite offers Excel a secondary in-memory datastore it can be used to share models across programming platform boundaries. This allows me to embed Python in Excel and use the in-memory SQLite database as a shared data conduit. I’ve done the same with JavaScript (but only on 32bit) and with .NET. The result is the non-native languages can be used as if they were not operating with Excel, no need to mangle in Excel interface code, simple!

Which brings me to Python. Why embed a Python interpreter into microETL? Can I not just use VBA as I quite obviously know it inside out? Well of course for all things Excel-focused that is what I do, and for most other transformation or calculations that require some programming I would also opt for VBA. No reason not to, it works and works quickly. However, Python (and in particular C-Python) offers access to world of superb code. If something can be automated, or interfaced to, or turned into a computing algorithm, somebody (usually very bright) will have done it in Python. Python also works well with C, and again if a C library is useful somebody will have wrapped it in Python. Python means that if I’m asked “Can you integrate Excel with ……?”, I can say, “Yeah, no problem”.

Python as a language has the other advantage of being a superb tool to work through a problem with; it doesn’t get in the way; very little of my mind has to be diverted to the language itself, the problem at hand gets the attention (C would be an example of a language on the other extreme, you better have the solution to your domain problem figured out before you start to program in C as the programming task itself will command most of your attention).

Python is an ideal language for business logic. It is easy to learn, eminently readable and generally immune to programming gotchas. Because of this, it’s very popular as a citizen programmers’ language, especially amongst engineers, scientists and quants, so it blends in naturally with microETL the ultimate datasmithing tool!

So if microETL is built on a foundation of “civilian friendly” technologies such as Excel formulas, VBA, SQL and Python is it a tool that anybody could use? In theory yes, in practice it depends.

Only a small proportion of the population has the ability to master procedural programming and an even smaller precentage has the interest in so doing. (A much large proportion of the population shows the ability and the interest to tackle Excel’s native programming method: functional no-side-effects programming using Excel formulas, which accounts for the appeal of spreadsheets to the general populace!).

So although I’ve abstracted away a large level of the complexity of automating Excel by means of the SQLScript and SQL functions, it’s still programming.  But even if the full power of the tool is beyond most, the tabular sequential nature of  SQLScript and the relative approachability of SQL should mean that many can at least follow the logic and data flows and might even be able to modify existing scripts. Those users with a good knowledge of VBA and or SQL should find the tool enables them to be more productive and cost effective. While those with Excel formula skills can usually work in tandem with a framework of microETL functionality with relative ease.

For the latest versions and articles on using microETL follow the microETL tag on this blog …


One response to “microETL – the ultimate datasmithing tool?

  1. Pingback: Ruby and SQLite. A micro ETL environment. | Gobán Saor