Ruby and SQLite. A micro ETL environment.

In a previous post I mentioned that I use a combination of Ruby and SQLite as a micro-ETL environment. Many of you have asked what’s a micro-ETL environment (indeed what is ETL) and why Ruby and why SQLite?

ETL, as all you data-warehousing and business intelligence folks will know, is the Extracting, Transformation and Loading of data from source systems into a reporting/data warehousing system. The techniques of ETL are not unique to the DW/BI worlds but are used anywhere large and/or regular transfers of data are needed between one computer system and another, for example master data take-on for new systems or transactional interfaces between front and back-office systems.

There are many tools available from the various BI and database vendors to help automate the ETL process (including the free and open source Kettle from Pentaho) but I suspect that a large percentage of ETL processing is still hand crafted using native database tools such as Oracle’s PL/SQL and SQLoader and native OS tools such as Perl, bash shell and awk. The major reasons for continuing to hand craft are flexibility, simplicity, speed, familiarity and cost. But as ETL tools have matured, the major benefit of the non-tool approach remains flexibility, i.e. the ability to handle any type of data from any type of source. This is where the micro-ETL environment comes in; a light weight programming and data manipulation tool set to complement (or in some cases replace) the declarative and more formal full scale ETL tool.

MS Excel is the micro ETL tool of choice for millions and rightly so; the fact that most source systems will export their data in xls/csv format or expose their data model through a com interface makes it a natural choice for extracting data. The power of Excel and the readily available pool of Excel skilled technical and, even more importantly, non-technical users makes it a winner for light weight data transformations. And, on the load side, there is not a database or ETL tool that cannot read and write excel compatible data-files. You can even use Excel to automate the transfer of extracted files from one system to another using FTP, HTTP POST or as I’ve recently done, via Amazon’s S3 protocol.

But sometimes Excel is not the ideal tool; a non-Windows OS is the obvious case, SOAP or HTTP delivered XML/Json APIs is another, when a fully automated solution is required or when the transformation requires a great deal of old-fashioned programming (embedded data requiring awk/regexp type extraction or badly designed source data schema requiring SQL manipulation).

In fact, it was the exercise of getting Excel to talk to Amazon’s S3 that convinced me to consider using Ruby where once I would have used Excel/VBA. I managed to get both the S3 SOAP and REST interfaces working using VBA, but it wasn’t straight forward and took much longer than the equivalent in Perl, Javascript or Ruby. I’m sure I could also have achieved similar productivity working under .NET or Java. So why pick Ruby? I’ll explain in my next post …..

UPDATE:

I’ve changed my mind 😉 MicroETL is now my ultimate datasmithing tool …
http://blog.gobansaor.com/2011/04/29/microetl-the-ultimate-datasmithing-tool/

Advertisements