Category Archives: xLite

MicroETL Alpha Release

As per my last post here’s an Alpha release of the MicroETL “SQL & Sequencing” tool for Excel http://bit.ly/boVxll

This zip file contains a 32bit version of the MicroETL add-in.  It includes Python support but to activate that you require the latest V2.7 of Python.

The add-in (MicroETL.xla) can be “installed” in the normal fashion but make sure all the associated DLLs (Sqlite3.dll, xLiteSQLit3.dll) are in the same folder. For Python support you’ll also need pydbVT.DLL and the initmicroETL.py file.

If you don’t wish to install the add-in, the example workbooks (TimeDimensionExample.xlma & TimeDimesnion97-2003Exmple.xls) contain Open_Workbook code that will load the add-in (providing they’re in the same folder).

The example is a classic ETL task, creating a Time Dimension. The 2010/2007 version takes advantage of the new Excel Table functionality, while the 97-2003 uses plain-old-Excel-ranges. The 97-2003 also outputs the dimension as a CSV file which will most likely end up in your MyDocuments folder.

Here’s the 64bit version  http://bit.ly/8Y0owq (no Python). To use this version you must have a 64bit OS AND a 64bit install of Excel.

The Excel project is password protected, if you need access to the code contact me and I’ll send you the password.

Advertisements

Manipulating Excel tables with SQL

microETL SQL Window

microETL SQL Window

Having upgraded xLite to be 64bit compatible and as a result having to re-test it, I decided to make some significant changes to how it works. Although xLite was, and is, SQLite embedded in Excel, having access to SQLite databases is not the reason I’ve combined the two products. The manipulation of Excel tables using a SQL dialect, all the while operating in memory, is my goal.

Having the ability to access external disk-based SQLite databases can be useful (particularly when dealing with very large fact-like tables) but it’s a perk not the primary use-case. The main aim is to bring the data-handling power of a SQL engine  to  bear on Excel tabular datasets; if Excel offered this by default, I would not be embedding SQLite. SQLite is a means to an end.

So what changes have I made?

  • First off, I’ve added a SQL helper form (right-click on cell context menu, and pick SQL). The form not only makes issuing SQL statements easier but also holds a history of commands if required.
  • Secondly, I’ve introduced a pre/post processor for the SQL() function.
  • I’ve also added two SQLScript (new name for xLiteScript) helper forms. One to run a SQLScript from the right-click context menu, plus a lookup form to help select the required microETL function along with a description of the parameters required.

Examples of the new SQL() pre-processor in action:

=SQL(“Select * from TABLE(Sheet2!A1)”,”Sheet1!B17″)

This will paste the table found at position Sheet2!A1 to Sheet1!B17.

=SQL(“Insert into TABLE(Sheet1!B17) Select * from TABLE(Sheet2!A1))

This will do the same as above, difference being that for this to work, an existing Excel table at Sheet1!B17 of the same structure (i.e. same headings) as Sheet2!A1 must first exist.

The TABLE(…) command tells SQL() to go fetch the relevant Excel ranges, loads the “tables” into SQLite; when the SQL command finishes it writes back the data (if changed) to the originating Excel address, and cleans up any temporary SQLite-side tables. Nearly all SQL commands can be issued against the resulting tables, including UPDATE, DELETE, DROP and CREATE TRIGGERS etc.

Two other pre-processor commands are CELL(…) and CELLS(…). Similar to TABLE(…), data is loaded from the Excel addresses specified and placed in the SQL  statement. CELL takes a single cell reference while CELLS will iterate through a multi-cell range and insert a comma-separated list into the SQL statement (in a format suitable for IN (…) expressions).

SQLScript is a simple table-based command sequencer  (ETL is primarily a step-by-step process) which can be used as an alternative to VBA  when a simple set of events needs to happen in sequence. The optional  SQL history generated by the SQL helper form is held in a SQLScript format and written back to an Excel table on exit, allowing for the creation of repeatable scripts based on previous SQL commands.

I still need to fully test the new functionality and will do so over the next few weeks before I publish a beta version. If anybody wants to try the un-tested alpha version let me know and I’ll send you a copy. you can find it here.

Update:

See http://blog.gobansaor.com/2011/03/09/microetls-sql-function/ for a more detailed description of the =SQL() function.

Update:

The table() command now supports virtual tables e.g. “select * from table(vtCSV,C:\data\myData.csv)” see http://blog.gobansaor.com/2011/05/05/excel-document-oriented-database-with-python-map-sql-reduce/

The 64-bit question, and the birth of MicroETL

Like the original $64 question, I have for a long time being pondering the Excel 64-bit question. Whether “To take it, or leave it”?

When first announced, I believed 64bit Excel would only be of interest to a minority of demented quants in investment banks hacking their way to yet another evil model to bankrupt the world. The problem of incompatible add-ins, COM controls etc. would also make its widespread adoption less likely.

But three things changed my mind:

  • first, the appearance of Windows 7 powered 64-bit PCs on many worker’s desks
  • second, my embedding of SQLite into Excel which enables me to effectively use the extra memory available to a 64-bit instance to hold, analyse and transform a huge amount of raw relational data (in the form of a “:memory:” database)
  • and finally, the arrival on the scene of PowerPivot.

Although a PowerPivot workbook doesn’t allow more than 4Gig of memory to be addressed (this is a SharePoint restriction, as a 4G workbook is likely to compress down to <= the 2G upload limit imposed by SharePoint), this represents a doubling of the 32bit limits in memory terms, but much more in raw data terms as PowerPivot can achieve up to 10x-20x times data compression.

So, I decided to bite the bullet; fired up an AWS 64-bit image and started to upgrade my xLite code-base to handle this new world.

I decided on the TDN-GCC MinGW-w64 distro to re-compile 64-bit versions of my SQLite wrapper  libraries, and set about converting  my VBA code to handle both 64-bit and 32-bit dlls. So far, easy peasy.

The problems started when I went to convert my JavaScript & Python wrappers. I couldn’t get JSDB to compile to either 32-bit or 64-bit using MinGW-w64, so I abandoned the attempt as my main use of JSDB is as a “command-line data-crunching tool” similar to, and used along side, CSVFix and SFK (Swiss File Knife). Having the ability to call JavaScript in-process was relatively new and I could live without it.

More worrying was the problem with Python embedding, in-process Python would be harder to live without. The APSW SQLite library which I had used is 32-bit only. Luckily the Python 2.7 version of its SQLite3 standard module  (aka pysqlite) now allows the loading of SQLite C extensions. This enables me to load databases under Python and extract the C handle using a specially written extension, thus enabling Excel/VBA to attach to a Python opened SQLite shared memory.

Without this new  loaded extension trick I would have had to modify the SQLite3 library itself as the previous 2.6 version would not divulge SQLite’s handle, while APSW did, which was  the reason I picked APSW in the first place.  As a result, xLite’s Python functionality now requires Python 2.7.

So the end result is a somewhat reduced-functionality xLite  (no in-process JavaScript, missing APSW functionality such as Python-based SQLite virtual tables), but running on the ultimate data-smithing platform:

  • Excel 64-bit combining with
  • the relational power of SQLite 64bit,
  • the fast and easy scripting power of Python 64-bit,
  • and topped off with the king of pivots, 64-bit PowerPivot.

A week well spent, I think!

Oh, I’ve also decided to rename the add-in to MicroETL. This is mainly to reflect the fact that xLite (SQLite embedded in Excel) is just one element within the tool. The  pure-VBA detached-ADO recordset functionality, embedded Python and xLiteScript elements can exist independently of xLite.  That, and I’m fed-up getting enquiries about X-Lite http://www.counterpath.com/x-lite.html and I own the http://www.microETL.com URL 😉

Micro ETL in the PowerPivot age

Although PowerPivot has many of the characteristics of an ETL tool, i.e. the ability to connect to disparate datasources, to filter that data and to transform it, it will still hit a brick wall when confronted by the typical data spewed out by operational systems. I’m sure this is by design as a sophisticated ETL tool is both complex to design and, probably even more relevant, is difficult to use.

Mind you, a few years back we IT pros would have said the same about front-end BI cube configuration, and behold today we have tools such as PowerPivot that prove that this doesn’t always hold true. Perhaps subsequent versions of PowerPivot will do the same for ETL as it has for BI cubes. In the mean time much of the necessary ETL will have to take place prior to loading into PowerPivot.  But where?

First off, what’s ETL?

The term ETL applies to one of the trinity of activities that have, over the last two decades or so, been at the heart of reporting/decision support systems. The other two terms: DW (data warehousing)  & BI (business intelligence), are sometimes used to refer to the whole process but can also be used to refer to two distinct sub-processes. Confused? Well, so you should be; these terms have been abused and redefined by scores of vendors over the years but for our purpose here we’ll stick to their roles as acronyms for the two of the  processes involved in the preparation & presentation of reporting data.

BI is the term now most commonly associated by non-IT folks with decision support systems, as it’s role is the most obvious i.e. front-end presentation and manipulation of data; the dashboards, pivots, charts, summary lists etc…

DW, data warehousing, is the term that most IT people who’ve been in the business for a while would use to describe the techniques, best practice etc. associated with this area. The heart of traditional DW was the data warehouse itself, a mighty repository of historical data optimized for reporting purposes. When DW as a concept started it was very rare indeed for operational (OLTP) systems to hold transactional data for more that a few weeks, usually just long enough to get through month-end.  Such specially built datastores’ days may be numbered as the data capacity of operational systems grow and the data munching ability of new ETL techniques (MapReduce for example) to transform vast amounts of data continues to increase.

ETL stands for Extract,Transform and Load; sometimes also styled, ELT, extract load & transform (PowerPivot would fall into this catergory). This is the process which traditionally swallowed most of the development budget in DW/BI projects (and kept me gainfully employed for years). It was the area where the dark arts of datasmiths collided with the often frightening reality of raw untamed data, with the added venom of corporate-politics-driven “data ownership” battles.  A messy business, and continues to be, even in these days of open data and open APIs.

ETL tools vary from text-editor written SQL to hugely expensive point’n’click ETL packages. Packaged ETL vendors promised (and continue to do so) that their tools would vanquish the dark arts of datasmiths with products that were so easy to use that the CEO would chip in with a few scripts to get the project finished. The reality was that IT types  found they had to learn yet another sub-optimal “language” and more often than not had to drop-down to “proper” languages to actually drive the thing to completion. ETL was (and still largely is) the preserve of IT.

The tools have improved a lot since those early days and open source has at least removed for some the 6 figure licensing costs from the equation. ETL, like programming in general, is hard, so get over it. Tools, basic knowledge of SQL and data modelling skills can help to make ETL approachable to non-IT types, but it still has the potential to make your head hurt.

So what’s a PowerPivot’r to do?

If your organisation already has a data warehouse in place you’re in luck as it’s quite likely a lot of the data you require will exist in the optimal PowerPivot import format, i.e. a star schema. You might be out of luck though, a significant percentage of DWs will not have used dimensional modeling and you could find yourself looking a complex OLTP like data model. In that case, and in the case of pulling the data directly from an operational system, you’re in the micro ETL business. Even if your IT infrastructure provides you with cleansed and understandable data, you’ll be faced with integrating external or shadow-IT data (probably one of the main reasons why PowerPivot appeals); again you’ll either need IT support or else you must learn how to do it yourself.

Long before the likes of PowerPivot appeared I regularly found myself  in need of a micro ETL toobox i.e. a set of tools that would enable me to quickly and cost-effectively prepare data for loading into some system or other. Nine out of ten times that system was an Excel PivotTable (the rest of the time it was usually a master data take-on task or some variation of systems’ commissioning). Although the consumers of my datasmithing services would most likely assume that I used Excel alone to perform these works of wonder, I usually had an Oracle database (along with its data loaders and superb PL/SQL language) as my secret ingredient. This combination of Excel and Oracle served me (and my clients) well, but it wasn’t the Oracle bit that gave me the edge; I could, and did, substitute SQL Server and MS Access for the SQL layer. The real trick was the combination of Excel’s flexibility/presentational strengths with SQL’s list handling power.

The problem with this approach was that the interface between the SQL engines & the spreadsheets often involved quite a number of manual steps, and the presence of database software (even MS Access) could not always be depended on. It took my discovery of SQLite to enabled me to finally combine the two worlds; microETL was born!

This combination of Excel, and an in-process SQL engine (provided by SQLite) with the added optional ability to call either VBA or Python scripts, has provided me with a hugely flexible and powerful micro ETL tool. Now, with the arrival of PowerPivot, I have both the micro ETL and micro BI tools to build cost-effective Excel-based decision support systems.

Being Excel based, means that the end result is delivered in a format that many business people are comfortable with; PowerPivot is designed very much with “civilian” datasmiths in mind. Likewise, microETL’s ability to use VBA, simple SQL and Excel formula to perform data transformations makes a large part (if not all) of the ETL process “civilian friendly”.

I’m not saying that everything I can do with microETL will be as easy for an non-IT datasmith; many datasources are either too difficult and/or extremely time consuming for end-users to navigate; but much of the business logic can be expressed in Excel terms with the highly technical or time-consuming tasks handled by SQL or VBA/Python/JavaScript. MicroETL is not only for once-off transformations but can also be used to automate ETL, report generation and refresh tasks (including refreshing PowerPivot itself).

So, if you’re thinking about utilising PowerPivot, but need help in preparing your data and automating the tasks involved, perhaps we should talk.

UPDATE:

I’m in the process of developing an off-shoot of mciroETL called HAMMER which I believe will be even more suitable as a micro ETL tool for the PowerPivot age. You can follow it’s progress here …

JavaScript as an Excel scripting language via JSDB

A few years back I posted about JavaScript as an Excel scripting language via ExcelDNA. That involved using JavaScript (in the guise of JScript.NET) as an ExcelDNA scripting language. It was purely an academic exercise to prove it could be done, I continued to use C# (or increasingly VB.NET) to build .NET user defined functions. This time however, I’ve managed to embed JavaScript (in the guise of Mozilla Foundation’s SpiderMonkey) directly via a native C interface not to prove I could do it (even though there’s a definite satisfaction in simply doing it) but to use it.

Why add another scripting language to xLite, hasn’t it already got Python?

True, Python is and remains a very powerful add-on to xLite. It’s a mature and long-established language, popular amongst IT professionals and “citizen programmers” alike. But, it’s a bit of a monster and can be awkward to package, particularly on Windows. By using Py2Exe and after a lot of digging on the issue of manifest files I have managed to package and isolate xLite’s Pythonic bits so that it can be used on a PC without first installing the required Python version (I’ve only tested against V2.6, Python’s lack of a side-by-side Windows installation capability is a major pain-in-the-butt bad news: tested against Python 2.5 & it doesn’t work; good news: side-by-side is possible; simply change the system path to reflect which-ever version you wish to run at the command line; xLite will (must) continue to use V2.6). This “version-hell” mitigates against using Python as a core-element of xLite, fine for those of us who are comfortable with and require the full power of Python, but not as the tool’s primary scripting environment.

No, what I need is:

  • a light (single EXE or DLL preferably) and an approachable popular language,
  • with native SQLite support,
  • runable as standalone executable (on both Windows & Linux),
  • embeddable (is that a word?) in Excel via VBA-friendly DLL.

Add to that essential list some nice-to-haves such as:

  • native COM-interface support (for the likes of ADO etc.),
  • native networking support, for HTTP, raw TCP sockets etc.,
  • native (and easy to use) XML and JSON parsers and emitters,
  • ability to spawn detached/attached command line processes and the ability to stream data to and from such processes; allowing me to easily orchestrate & provide a “grid” of processes (scripted, command line executables, Excel instances) both local and remote (with remote being either traditional servers, http servers or Hadoop Streaming grids).

While looking at the various implementations of Javascipt as a server/shell language, I came across JSDB – JavaScript for Databases. It’s a C++ wrapper around Mozilla’s SpiderMonkey, with lots of useful data related utility classes added. To make JSDB a perfect fit it simply required:

  • a few minor changes to the SQlite class (allow the loading of Virtual Table extensions, add the ability to pass in the address of already open SQLite memory structures);
  • a linker change to use the DLL version of SQLite;
  • plus a VBA-friendly DLL wrapper ( & VBA declares, to call the DLL) to replace the JSDB shell when embedding in Excel.

From my SQLite as the MP3 of Data post:  “Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.” I can now add a distributed “fractional horsepower” processing engine for that distributed data. This transforms xLite from a micro-ETL platform into one capable of handling (or at least orchestrating) practically any ETL (Extract, Transform & Load),DI (data integration) or “Time Asset” (see this post) process.

UPDATE:

Here’s a VBA source with examples of using the embedded JSDB engine:

http://www2.gobansaor.com/share/example_jsdb_vb_embed.bas

… and the source code for the DLL wrapper:

http://www2.gobansaor.com/share/wrap_sqlite.cpp

… plus the VBA Declares and helper Windows APIs:

http://www2.gobansaor.com/share/jsdb.bas

UPDATE:

See JSDB being used to front-end Excel as a web service http://blog.gobansaor.com/2011/02/06/excel-as-a-diy-web-service/