Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection(“localhost”,”7777″,”admin”,”admin”)
database = connection.getDatabaseByName(“Demo”);
;cube = database.getCubeByName(“Sales”);
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0) dataSet << cube.getDimensionAt(3).getElementAt(0) dataSet << cube.getDimensionAt(4).getElementAt(0) dataSet << cube.getDimensionAt(5).getElementAt(0) // fetch data set datas=cube.getDataArray(dataSet as Element[][]) connection.disconnect(); // parse the return string rowcount = rowElements.length; columncount = columnElements.length; data=[] heading=[] // first row set to the row names (i.e. "Product name" followed by the country names ) heading << "Product" for (i in 0..columncount-1) { heading << columnElements[i].getName() } data << heading // Now out each line for (i in 0..rowcount-1) { row = [] row << rowElements[i].getName() for (j in 0..columncount-1) { row << datas[((i + (j*columncount)))] } data << row.flatten() } //output to csv file def csvOut= new FileOutputStream('c:/data/File.csv' ) for (lines in data) { lines.eachWithIndex{col,i ->
if (i > 0) {
csvOut << "," } csvOut << col } csvOut << "\n" } csvOut.close() [/sourcecode] [/code] This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component). Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Advertisements

8 responses to “Groovy as Talend’s scripting language

  1. jsr241 has been in the “expert group formation” stage for over four years. It’s a misconception that it’s being developed under a JSR at all.

  2. @Neal

    Thanks for pointing that out, it is misleading. I must admit I (like most folks I guess) wouldn’t know one end of a JSR process from the other!

    Tom

  3. Talend probably will start downplaying its Perl roots. I can see them even discontinue Perl version down the line.

    Advantages of using Talend/Perl:
    1. Could deploy Perl code as an exe on Windows machines without needing to upgrade JVM client (this was a restriction based on corporate policy).
    2. Have a ready scripting language.
    3. Plentiful modules are available for Perl in Talend and outside.

    Disadvantages of Talend/Perl
    1. Exe works only on Windows. For Linux, I am not sure I can create an executable that combines all the modules.
    2. Keeping together all the modules for DB connections etc. is a pain.
    3. No JDBC connection and a lot of DB connections will require one to have a client installed (like DB2). I think this is a big downside.

    I chose Perl due to JVM upgrade issues on my first project as well with familiarity with Perl. But more I look, I feel that I need to learn Java and start using the Java version. If JVM version is not an issue, I think Java probably has all the advantages and non of the disadvantages (now that one can use Groovy).

    I might have misstated some of the above. Anyway, what is your opinion on Talend Java vs. Talend Perl as you seem to be comfortable with both the version.

  4. Sean,

    It was only when the Java version of Talend appeared that I started to look seriously at Talend; although Perl has its advantages, within most corporate environments, Java is an easier sell (at least to “civilians”).

    JVM issues are less of a problem now than multi-megabyte downloads/installs are common place (i.e. package your jars in an EXE, along with either the JVM you require or a facility to download the version required (using tools such as http://jsmooth.sourceforge.net/ )).

    The Java world is a rich source of corporate-focused APIs, JDBC is mature and robust, so even without Groovy, I would still be sticking with the Java version. Groovy just makes it better and means practically everything I need can be provided by the one environment.

    If you are about to learn Java, then Groovy is a ideal way to start, in fact for the sort of activities you’re likely to engage in with a tool like Talend, you may never need to write “real” Java.

    Tom

  5. Tom,

    What you are saying makes sense. Normal business people are more comfortable with Java than Perl (which most have not even heard of). Also I am realizing that one can have different versions of JVM w/o affecting each other.

    I got tired of dealing with Perl modules very fast. Also the API support, JDBC and everything points to using Talend/Java rather than Talend/Perl.

    I’ve looked into Groovy. As you suggest due to similarity with Java, it might be a good place to start.

    Thanks again for a great blog post.

  6. “Groovy (by the way what a terrible name for a language, or is that just me?), ”

    It’s just you. “Groovy” is a great name for a language.

  7. Pingback: 16.4. Comparativa ETL Talend vs Pentaho Data Integration (Kettle). « El Rincon del BI

  8. Pingback: Comparing Talend Open Studio and Pentaho Data Integration (Kettle). « El Rincon del BI