Category Archives: Talend

Excel in Data

My speciality is data, more specifically, data at the edges. Data that has been extracted from, or needs to be generated outside of, formal systems. Such data tends to be the extremely valuable and useful, but only if shaped into a format that has meaning to those who can use and analyse it. I’ve been extracting, marshalling and shaping such data for most of my 30 years in the IT business, I’m a datasmith.

The technical term for what I do is ETL (Extract, Transform & Load). It’s the name given to the skill-set and tools associated with the movement of datasets. It’s the unglamorous, but essential backbone, to a whole range of IT deliverables: Data Warehousing, BI (Business Intelligence), Planning & Forecasting, New System Commissioning, Systems Integration.

In the past my skills would only have been of use to large IT departments, as major “data shaping” activities such as BI & data warehousing were “owned” by IT; small businesses and departmental workgroups were outside the loop except as providers of requirements (and money) and as long-suffering “end users”.

This alienation from the process of data shaping and the need to simply “get things done” led to the evolution of “shadow IT”, usually involving Excel is some form or another. Although many attempts have been made to control and tame this wild-west of data creation and consumption, they’ve largely failed. With good reason, the best people to get value out of data are those who depend on and generate the data in the first place.

Microsoft has long recognised this and has over the years improved Excel’s ability to handle and pivot lists (aka tables, the primary data artefacts of business). Excel’s PivotTable is undoubtedly the world’s primary BI tool.

Microsoft has, with Excel 2010, introduced an even more powerful end-user pivoting tool, the free PowerPivot add-in. The world of self-service data analysis has changed for ever; not only can PowerPivot handle enormous amounts of data but it enables end-users and data analysts alike to build highly complex analytical models at a fraction of their previous cost.

PowerPivot offers the greatest returns to those organisations or individuals who have:

  • access to report-friendly data sources, such as those provided by existing data warehouses.
  • The resources to purchase and maintain an Enterprise SharePoint farm (backed by Enterprise SQLServer) allowing PowerPiviot dashboards to be shared (and controlled) via a web interface. Such infrastructure, although ideal, comes with an expensive price tag.

So, if your organisation or your team is missing one or both of the above is PowerPivot not for you?

Far from it, PowerPivot is still extremely useful without either. Data that’s not in the ideal format can still provide useful analyses, and both Excel and PowerPivot can help cleanse such data. Likewise, just like an ordinary workbook, a PowerPivot workbook can be emailed or placed on a shared drive (all the supporting data is embedded in the .xlsx file). The only problem is, the time and resources associated with this activity can be substantial if you don’t have the ‘micro ETL’  and general Excel VBA skills to help automate it. This is where I can help.

I’ve both the skills and the tools to build a rock-solid Excel-based ETL back-end for your PowerPivot projects. This can be as simple as a “good enough” data shaping automation for a single PowerPivot cube or a sophisticated ‘micro-Data Warehouse’ which would act as a foundation for a more long-term self-service BI strategy. The only software product you’ll need to purchase is Excel, I’ll provide the rest. Most of the time Excel 2010’s  & Powerpivot’s in-built ETL capabilities, automated with some VBA, will be all that is required but when more ‘firepower’ is required:

  • I’ve assembled my own Excel based ETL tool specifically to enable me to quickly and cheaply build such micro-ETL solutions, it’s a collection of best-of-breed open-source tools (such as SQLite) driven by, and integrated into Excel (versions 2000 – 2010, 32bit & 64bit).
  • I can call on Palo OLAP (an open source, Excel focused budgeting & forecasting tool)
  • and on Talend, a leading open source ETL and data-management suite.

I can also automate the work-flows and non-ETL repetitive tasks associated with such projects. Tasks such as: generating questioners, marshalling and recording the resulting replies, generating non-PowerPivot workbook reports/pivots (remember, unlike previous versions, Excel 2010 can exist side-by-side with older versions of Office, only need to provide it to those who need its full power).

As for building PowerPivot cubes, I’ve a long background in the design & delivery of ROLAPMOLAP solutions. PowerPivot is essentially a 3rd way of producing OLAP cubes, heavily influenced by the dimensional modelling technique commonly know as the star-schema, but with the presentational simplicity of the Excel’s PivotTable and an Excel-like formula language (DAX) bolted on. PowerPivot models built with an understanding of the usefulness of the star-schema approach will allow for richer and simpler data analystics. I can provide that knowledge or help you master it.

My involvement can be either on a “deliverable” basis, i.e. I deliver the required solution, usually fixed-price for a fixed deliverable or on a consultancy basis, where the “deliverable” is more an ongoing process and would often involve a degree of skills transfer.

I excel in data, with tools such as PowerPivot so can you.

Advertisements

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

Talend ETL Excel report generator

Hugo, who you may remember from his OLAP Cube as a Mind Map project, has struck again. This time something really useful, a component for the Talend ETL platform that generates Excel reports using templates and a JSP style TAG language to control the output.

I’ve in the past used the excellent Xlsgen to automate the production of Excel reports, but Hugo’s component has the benefit of being free (xlsgen now costs €390!) and open source and it also taps into the vast world of existing Talend ETL components.

Well done Hugo.

Why not join me on Twitter at gobansaor?

Pentaho Data Integration (Kettle) V Talend Benchmark

Pentaho’s Matt Caster has just published a benchmarking exercise comparing Kettle and Talend.  In it he admits he’s not a Talend expert and he advises that people should perform their own benchmarks where possible as requirements differ.  Nevertheless, unlike most other benchmarks we’ve seen on the subject he publishes not just the results but the actual transformation “code” used in the tests. 

For many people these benchmarks are of no real interest as long as the product does what is required within the time and resources available they’re content.  But it would be a mistake to think that benchmarks don’t matter, they do; people have and will make that final decision based on them.  Remember ETL is not life and death, the decision which tool (if any) to go with may not get the level of investigation that the developers behind such products expect of their potential clientele and this is particularly true of open source.  Busy people will use such reports to direct them down a path or to confirm their existing prejudices. So I’m really glad to see Matt responding and in particular, responding in the manner he has.

Databases vendors have for years played the benchmarking game, setting and breaking records either via real technological advances or simply gaming the process.  We as purchasers and users knew in many cases to take the results with a large dose of salt, but purchasing decisions where nevertheless made on the backs of these surveys.

Why not join me on Twitter at gobansaor?

Open Source Metrics and Benchmarks

Marc Russel’s blog links to a Manapps ELT benchmark report comparing the performance of several leading ETL tools both proprietary (DataStage and Informatica) and OS (Talend and PDI (aka Kettle)).  As would be expected each tool has their own strengths and weaknesses, but one thing stands out, the venerable Kettle ETL aka PDI 3.0 is now a serious contender for handling very large datasets.  Oops, that’s what I get for wishing for a result and (mis-)reading the report early in the morning with a cold and bad sore throat, sadly PDI is still very much slower that its OS cousin Talend. In fact, Talend continues to play on the strength that comes from a code generated sloution, i.e. raw speed.  As a pure ETL play, Talend is well capable of playing on the same pitch as the “big kids”. 

Interestingly, the report is also “open source” as it’s released under a Creative Commons License, so I can link to it here.

UPDATE:

There’s now a new version of the report available (www.manapps.com, Topic Benchmark), it seems the original was just a work-in-progress and was not meant for public release.  The main difference appears to be a significant improvement in Informatica’s ‘score’, but I’m not sure as I was really only interested in comparing the two OSS products, Talend and Pentaho PDI, in that ‘battle’ Pentaho still comes out ‘slower’.

 The original Marc Russel blog entry and a subsequent one reporting the new updated report appear to have both been removed.  

Also, I was informed of the ‘updated’ report via this email from manapps, which assures vendors that they are happy to rerun any tests and provide any information re the running of such tests … 

Dear Sir,

You referred on your web site to the report called “Benchmark ETL” by Manapps, from November 2008. This draft report was not intended to be publicly released since just a working document.
We would like you (i) publish Asap the modified version (or its related link) that supersedes the former one (on our web site (www.manapps.com, Topic Benchmark), (ii) state that Manapps had no intend to release the former report and accordingly takes no responsibility on its content, (iii) state that Manapps holds all necessary elements at the disposal of all vendors so that they can rerun some tests if wished that will then be published.

Regards,
Philippe THOMAS

Time: Thursday March 5, 2009 at 5:10 pm

 

Another analysis of OSS in the wild this time from Chris Keene, WaveMaker CEO, on OSS as a marketing tool. Bottom line, 1% conversion rate, 700 paying customers in 9 months …   

WaveMaker OSS as a marketing tool

WaveMaker OSS as a marketing tool

 

Why not join me on Twitter at gobansaor?