… well, at least for me. Let me explain.
For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of Oracle10g Express I can even run my own Oracle instances at home or on EC2. The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation. But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend. But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle. The combination of Talend, SQLite and Groovy, is proving to be particularly magic.
So how will these three tools enable you to leave behind your Oracle past?
Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a multitude of APIs via its Java component for non-database data sources. The addition of Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required. And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself. And all this for free, and better than free, open source.
So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?
As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious. But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.
select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”) as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”) as region_total_YTD from list_of_top_customers;
… where getapalodata is a UDF that wraps calls to a Palo cube.
With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.) SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.
… and the spreadsheet access?
A Talend sub-job such as this…
Talend Groovy Jetty web server
Example of Groovy code calling Jetty API
…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control. For example …
http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable
… this would return a list of customers wrapped in an HTML table, or …
…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.
What would the Talend job look like?
The job might operate something like this:
- It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
- At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases. It might also build Palo cubes or update larger enterprise databases.
- The job would then setup a Jetty web server and await requests for data.
- The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
- At a fixed time the service would shut it self down and requeue itself for the next day’s workload.
… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?). You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).
Now tell me that doesn’t make sense?