Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …


…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?


7 responses to “Talend + SQLite + Groovy the new Oracle …

  1. Hi Tom,
    Another excellent article. I look forward to your posts. I am going to clip it for future reference.

    Currently, I am using Talend for simpler tasks (read data, transform and output). I have not tried using web services, java, groovy etc. all mashed up together. But this is an excellent tutorial for it. Any chance you can share your sample jobs so I’ll have a starting point?

    Also in your estimation, how well will this scale if there are few million rows and complex calculations are involved?

    Everyday I work with Informatica which is a great tool. But the ease and flexibility of Talend+JDBC is the best value proposition I’ve seen in a log time. Though in our big company (my day job), this will be unthinkable.


  2. @Sean,

    I hope over the next few months do a series of posts, detailing how I would approach this, free time permitting.

    As for would it scale? If you happen to work for a major telco or other large utility or bank then I would say no, even though at a departmental level within such organisations, yes. As for millions of rows, if those rows were fact table rows, yes, if they were dimensional rows, no.

    With this approach if you did reach a performance bottleneck, you’ve a multitude of escalation options to choose from, some free (replace Groovy with Java, replace SQLite with PostgreSQL), some not (roll back in Oracle or SqlServer, replace Palo with Essbase), But, I think the real test whether an approach like this can scale would lie with whether or not the domain being modelled is amenable to being built out rather than up ,i.e. as you grow you add more servers and instances rather than a more powerful machine with more CPU/RAM plus a more powerful database/MOLAP server.


  3. I look forward to your articles that will flesh out this idea better for the readers.

    Also thanks for the tip on scaling. I was talking a long-shot project I might do. I’ll rather build out and keep the basic tool set same. I hope I’ve that problem 🙂


  4. Cool article, nice ideas. One thing I don’t understand is this statement:

    “Talend offers a built-in option to publish an ETL job as a WAR file exposing a SOAP web service”

    I have looked at Talend Open Studio but could not find any reference to this. Is this part of their commercial add-on service?


  5. Nevermind, silly question. Missed the export option.

  6. Pingback: SAX and Bugs and XBRooL « Gobán Saor

  7. Pingback: A Drop In The Stream › links for 2010-05-07