Open Source Metrics and Benchmarks

Marc Russel’s blog links to a Manapps ELT benchmark report comparing the performance of several leading ETL tools both proprietary (DataStage and Informatica) and OS (Talend and PDI (aka Kettle)).  As would be expected each tool has their own strengths and weaknesses, but one thing stands out, the venerable Kettle ETL aka PDI 3.0 is now a serious contender for handling very large datasets.  Oops, that’s what I get for wishing for a result and (mis-)reading the report early in the morning with a cold and bad sore throat, sadly PDI is still very much slower that its OS cousin Talend. In fact, Talend continues to play on the strength that comes from a code generated sloution, i.e. raw speed.  As a pure ETL play, Talend is well capable of playing on the same pitch as the “big kids”. 

Interestingly, the report is also “open source” as it’s released under a Creative Commons License, so I can link to it here.

UPDATE:

There’s now a new version of the report available (www.manapps.com, Topic Benchmark), it seems the original was just a work-in-progress and was not meant for public release.  The main difference appears to be a significant improvement in Informatica’s ‘score’, but I’m not sure as I was really only interested in comparing the two OSS products, Talend and Pentaho PDI, in that ‘battle’ Pentaho still comes out ‘slower’.

 The original Marc Russel blog entry and a subsequent one reporting the new updated report appear to have both been removed.  

Also, I was informed of the ‘updated’ report via this email from manapps, which assures vendors that they are happy to rerun any tests and provide any information re the running of such tests … 

Dear Sir,

You referred on your web site to the report called “Benchmark ETL” by Manapps, from November 2008. This draft report was not intended to be publicly released since just a working document.
We would like you (i) publish Asap the modified version (or its related link) that supersedes the former one (on our web site (www.manapps.com, Topic Benchmark), (ii) state that Manapps had no intend to release the former report and accordingly takes no responsibility on its content, (iii) state that Manapps holds all necessary elements at the disposal of all vendors so that they can rerun some tests if wished that will then be published.

Regards,
Philippe THOMAS

Time: Thursday March 5, 2009 at 5:10 pm

 

Another analysis of OSS in the wild this time from Chris Keene, WaveMaker CEO, on OSS as a marketing tool. Bottom line, 1% conversion rate, 700 paying customers in 9 months …   

WaveMaker OSS as a marketing tool

WaveMaker OSS as a marketing tool

 

Why not join me on Twitter at gobansaor?

Advertisements

13 responses to “Open Source Metrics and Benchmarks

  1. It’s an interesting observation you had there Tom. There is no reason why copying (Text-to-Text) a small file would be slower than a large file in the case of PDI since we’re using a streaming engine. As such, all parties involved would probably be interested in verifying these results. IMHO it’s a little bit too coincidental that an unknown “research institute” from Paris heavily supports a Data integration company from Paris.

    This goes double up if you consider things like “ease of use” etc. That’s highly subjective at best.

    Enough said.

  2. @Matt,

    Yes, some of the results seem odd. Although I mainly use Talend now I still have a soft spot for Kettle and think it is a excellent (and very easy to use) tool in a data warehousing environment (rather than the MicroETL/DI tasks I do) ,especially combined with all the other Pentaho “goodness”.

    In my own (very un-scientific) tests Talend has always been faster than PDI for text file processing but not to the degree being reported here. For transformations involving database to database, the differences were usually insignificant. I would also think PDIs ability to easily split processing over several parallel streams would also help speed up things considerably.

    Tom

  3. Well, to be honest, 3.0.0 is now a year old, we improved performance all over the place. Then again, things like lazy conversion were added to compensate for the slower file processing. I doubt a lot of people process files these days, but if you do, it shouldn’t be any slower than Talend (on the contrary).

    From looking at the “report” it looks like it was written by a person very experienced in Talend. Based on that experience they took the benchmarks that Talend is doing well at (text file handling, in-memory stuff) and then tried to do it in a PDI, DS and INFA. Well, that system fails on numerous occasions. I’ve seen several occasions for the PDI transformations where they took the slowest possible options to do the work. Especially the whole ELT logic fails miserably. Reading from a database table and then doing an aggregation is really silly. In PDI you would just write a SELECT with a GROUP BY, right? Right? All of a sudden we forget how to write SQL?

    Finally, I think that the 44 seconds penalty that Informatica get with all operations is highly suspect as well. It might be older technology, but I **really** doubt it’s doing as bad in real life as it’s doing on some of these benchmarks.
    Anyway, I think the whole thing is kinda sad for Talend that they had to revert to these kinds of practices. It’s not really the thing you do as an open source company IMHO. Let me evaluate the numbers myself: give me the source files and used transformations.

  4. Matt,

    Your right as regards SQL, can’t understand this aversion to using it. In fact, even when I’m mainly dealing with text or Excel data I load the data into a SQLite database (in memory preferably) and let it do the heavy lifting; quicker and less error prone than using tool based transformations.

    Anyway this speed thing is a red herring, people will use the tool (and programming language) that they “want” to use (even to the point of paying big bucks when there’s an OSS solution that could equally do the job!).

    If there are speed problems these days it’s often easier to throw hardware (or another EC2 instance) at the problem.

    Tom

  5. This is a nice find. Thanks Tom.

    I have not used Kettle or Data Stage at all. But I am very familiar with Informatica. I could see that they developer seemed a bit unsure with INFA as a number of places, he/she could have used a ROUTER rather than a FILTER and saved some time. Also it is a good practice to user a SORTER before an AGGREGATOR as the sorter transform is considerably faster than the aggregator in INFA. Another thing in Informatica’s defense is that this is being done on an XP laptop/desktop. INFA probably is significantly faster when deployed on an AIX type machine with large memory. And I also agree that the best way to aggregate is in the DB if possible. Even with INFA, they have taken the output to an aggregator (again without a sorter) rather that doing it in the Source Qualifier via SQL query.

    As far as bias is concerned, they did show Datastage PX to be as fast or faster than Talend. So this might be inexperience with the other tools rather than deliberate bias.

    I feel that one is going to use the tool that one is most comfortable in (for me Informatica and Talend at this point) unless you just cannot achieve something at all with that tool.

  6. Thanks Tom for the link.

    We all know that:
    – a benchmark is a benchmark and remains subjective (you always know a tool more than another one…).
    – benchmarks are never pleasing enough for everyone.

    This being said, even if certain results need to be looked at closer (especially Informatica’s results in my opinion), this is a really good job from the Manapps team. It takes a lot of time and skill to produce this kind of paper and I salute their initiative.

    Great job!

  7. “Talend experts writes benchmark that says Talend is faster”.

    I don’t salute their initiative at all. It’s misinformation at best, slander at worst.

  8. Talend experts? You’re kidding!
    I think the guys who wrote this bench don’t know much more about Talend than you do Matt!

    Some examples: they didn’t use v3.0 (milestone versions of v3.0 where available since July
    and we made some huge enhancements on a lot of components on tFileInput, tAggregate, t…)), in numerous jobs, the basic approach is used, it would be much better (efficiency speaking) to leverage dedicated components (tJoin, tFilterColumns…) but I think theses observations are also good for others tools).
    No really, Talend experts are not working like that.

    Fabrice | talend

  9. Pingback: Nicholas Goodman on Business Intelligence » Blog Archive » An arms race my customers don’t care about

  10. I appreciate the effort, but the referenced article isn’t convincing for me. If he truly wants to be scientific about it, he needs to provide the actual files used, what the program settings were, etc. so it could be replicated. Otherwise it’s a bit like cold fusion.

    The wording used in statements like below suggest a bias to me as well. This and the lack of experimental detail leaves me waiting for a study that validates the results here before I’d accept any of it as fact. There are too many missing variables.
    “Our main reason for this assessment of Pentaho is
    mostly linked to the many parameters that need to be learnt. However, we think that if you *invest lots of time in it*, it could become an powerful tool.”

  11. Dan – spoken like a guy who spent years working at a company that knows a thing or two about experiments? 🙂

  12. Nick, I read my comment, and it made me laugh. 😮 Ok, I was expecting a little much, and it would be *really* good to have that kind of comparison around, but it has to be fair and thorough enough to be repeatable by others. This one just seems to be missing too much terms of how it was run.

  13. Pingback: Pentaho Data Integration (Kettle) V Talend Benchmark « Gobán Saor