Data Wrangler

A few weeks ago I came across (thanks to @lismissData Wrangler; a very promising data cleansing tool from the Stanford Visualization Group. Not only is Data Wrangler a web-service (which the group intend to open source) but it also allows transformations to be “recorded” in either Python or JavaScript (see here). It was this Python scripting feature that really caught my attention; would be very useful to be able to hack away at a dataset using the service, then transfer the script to microETL’s PyScript to adjust and integrate with Excel and SQL.

The demo video and test datasets give a good overview of the tool but the proof of the pudding is in trying out some real world dirty data; I chose a fine example of the art of Freedom Of Information datasets, issued by a Republic of Ireland government department. As an example of how not to do something (unless your intention is to make the recipient regret asking for the FOI in the first place) this is excellent. (I suppose we should be grateful it’s in Excel not Word or PDF or even PowerPoint). You can download it here (the data as released is in the FOI sheet).

As I said, Data Wrangler is promising, but needs some more work (to be fair, the group warns it’s a work in progress). The tool choked on the FOI dataset, too many columns I think, so not ready for the real world yet but I’ll be keeping an eye on its progress. Don’t let my experience put you off, it looks more than capable of handling smaller but still quite messy datasets.

If you’ve downloaded my example workbook, you’ll see how I managed to cleanse the data using microETL’s Python & SQL scripting functionality (the PyScript is in the Python sheet, with the SQLScript in the Control sheet). I could have cleansed the data using pure Excel and some VBA  and perhaps I would have if this was a format requiring parsing on a regular basis; I could then save the transformation as a single file macro-enabled workbook, ideal for sharing, no need for add-ins etc. But it was a once-off, and even if it wasn’t, it’s quite likely the format supplied in answer to a subsequent FOI request would be different. This is the sort of work that microETL’s Python & SQL scripting is designed for; quick and dirty data wrangling, but with the ability ro persist, and modify the resulting transformations if so required.

If you wish to try out this example, there’s a new version of microETL (Alpha1.08) available for download. You’ll notice a new folder structure (the usual sub-folders are now under a single sub-folder call microETL) to make installation of the add-in somewhat neater; and there’s also a setup.xls that’ll do all the hard work of installing (and un-installing) the microETL add-in. Note: you still need to manually install Python 2.7 to enable the PyScript’ing functionality.

If you need help with your Excel, ETL or  data cleansing tasks, I can help.


3 responses to “Data Wrangler

  1. Thanks Goban, worth looking into.

  2. Pingback: The Datasmith’s Hammer – 1st cut … | Gobán Saor

  3. Pingback: A frown, or two – Data Explorer to PowerPivot gotchas. | Gobán Saor