Category Archives: SQLite

HAMMER and Threads Redux – The adventure continues…

Sometimes, like Scott of the Antarctic, your data shaping activities may need to “go outside” of Excel and it ” may be some time” before you return. That’s primarily why the HAMMER.exe command-line version exists; data can be packed up via the DELEGATE or SUBMIT commands; picked up by external HAMMER processes (I often use JSDB’s Grid micro-messaging JavaScript toolkit to orchestrate such workflows, but perhaps now I’ll use this – Amazon’s latest gem – AWS SWF !); then returned safely (unlike Scott) to Excel for further processing.

But, when “going outside” is not desirable or feasible, another solution is required, namely threads. There are those who regard threads as evil and I too would have to admit I generally would not use them without good reason. But sometimes they’re needed, usually to keep a controlling UI (in this case Excel) responsive and capable of monitoring and managing background long-running tasks.

I’ve already added threading capability to HAMMER via the hammerThreadEnabled function and via “internal HAMMER threads” (using NET4.0’s latest parallel programming additions).

Both of these “lock” the Excel main thread until complete, with hammerThreadEnabled being also dependent on Excel’s dependency graph to determine if threading will happen or not. So I figured I needed to bite the bullet and add a new set of threading commands to better handle long-running tasks, be they tasks that eventually return data or a status to the main Excel thread (such as a long running save to ( or fetch from) a high latency web service end point), or those that run continuously (such as a Excel as a web service). The main new commands are:

  • PYTHONTHREAD – like PYTHON but spawns a thread to run to Python script. (NET4.0 runtime required)
  • JAVASCRIPTTHREAD – as above, but using JAVASCRIPT plus no need for NET 4.0.

Both commands “take ownership” of the HAMMER session’s SQLite database  and return the new thread’s ID (in the form PYTHREAD1,PYTHREAD2, JSTHHREAD3 etc), and are likely to be the last command in a HAMMER function call’s sequence.

The thread ID can then be used by the following helper commands to monitor, re-join and clean-up such threads:

  • SQLFROMTHREAD – like SQL but takes two arguments, 1st is the thread id, and the 2nd is the SQL to apply against that thread’s SQLite database. e.g. = … “JSTHREAD1″,”Select * from messages”,”SQLFROMTHREAD” … In general, only Select statements would be issued against this database, but SQLite is thread-safe so data updates are possible but with obvious potential for SQLITE_BUSY locks and time outs. (Likewise, the thread itself can communicate back to main thread using an APPDB connection, as APPDB is always attached uniquely to the main Excel thread).
  • THREADRETURN – will report on whatever return values (be that an error, or on success, a string or a table value) once the thread completes. Again, it uses the thread ID as its argument. If the thread is still running, the command returns “Waiting”.
  • DELETETHREAD – will attempt to terminate an already running thread. It will then, or if already terminated, clean-up and close any resources associated with the thread, mainly its database, but also its Python and/or JavaScript engine instances.

It’s also possible to issue a HAMMER call that will re-join a completed thread’s database, this will happen if a HAMMER function call’s 1st argument is a valid thread-name e.g. =HAMMER(“JSTHREAD2″,”Select * from results”,”SQL”). If the thread is still running, the HAMMER function will exit with a return value of “Waiting”. With this, it’s possible to not only access the thread’s database but also its Python and/or JavaScript engine context. Multiple such HAMMER calls can be made until a DELETETHREAD command is issued or the Excel session is terminated.

As SQLite is thread-safe, SQL can be used as a simple and robust method of inter-thread communication. If the pattern of threads only updating their “own” database while the main Excel thread updates only “APPDB” is followed, very little locking or timeout issues should arise (ha, famous last words – okay, not as famous as Scott’s). There’s also nothing to stop more adventurous inter-thread single database usage if the need or impulse arises (“Do ya feel lucky Punk, well do ya!”). And don’t forget, normal disk-based databases can also be used alongside, or instead of, the more usual in-memory ones.

This use of SQLite falls neatly into my view of SQLite not so much as a database but as means of mapping and managing  shared memory using SQL as an  API.

It’s often said that SQLite should be viewed not as an alternative to Oracle/MySQL/MSAccess/whateverRDBMS but as an alternative to a file-open statement. But for me, I primarily use SQLite as an alternative to rolling-my-own in-memory shared data structures  (be that across threads or application stacks).

Two other new commands added with this version are:

  • VERSION – will return the version of the internal HAMMER (XLite) engine is use.
  • LISTCOMMANDS – returns the list of commands available.

There’s also been a small breaking change to the JAVASCRIPT command. Scripts will now only return a table if the return value is a JavaScript Array of Arrays (List of Lists) object or a JSON string representing a LOL. All other objects will be converted to a JSON string representation or scalar string value.

A few extra JavaScript focused helper methods have also been added to the passed-in “db” object:

  • db.jsCreateTable(tableName, JSONaLOLorLOOobject) where the 2nd argument can be a “JSON table” in either List of Lists or List of Objects format. A table will be created using either the first “line” of the LOL or the “names” of the objects fields in the LOO to provide the column names (no types applied, as no need most of the time).
  • db.jsInsertRows(tableName,JSONaLOLorLOOobject) – as above but will insert/append the pass’d “rows” to the specified table.
  • db.jsonSQLQuery(SQLSelectStatement,typeOfJSON) will return a table in List of Lists format if typeOfJSON = “LOL” or as a List of Objects if “LOO”.

There’s also a new JavaScript command: loadAssembly(dotNetAssemblyName) – which will load a .NET assembly allowing the use of any .NET library within JAVASCRIPT.

Both the JAVASCRIPT and threading commands are at a very early stage, only minimal testing so far, so use with care.

To download the latest version of HAMMER, use this link.

Advertisements

VBA Multithreading, .NET integration via HAMMER

In a previous post I urged all potential datasmiths to learn a scripting language (I suggested Python). But what of VBA, the granddaddy of the scripting world? Well yes, if you have a need to automate Excel then you must learn VBA. VBA is to Excel as JavaScript is to the modern browser, its tight integration with Excel’s Object Model and its superb debugging facilities makes it the optimal choice for automating Excel.

VBA is now a one trick pony (Office automation), but Python opens all sorts of API scripting and product automation doors. My use of IronPython as HAMMER’s scripting language is one such door, a door to the otherwise mainly-closed-to-VBA world of multi-threading and easy .NET library integration. [UPDATE: Feb 2012 – HAMMER now also offers JavaScript as a scripting option]

The HAMMER function can be called from VBA using the Application.Run command like so…

retArray = Application.Run(“DATASMITH.HAMMER”,inArray,”Select dept,count(*) from table1″,”SQL”)

… the 1st parameter is the function name, parameter two is the 1st argument to the function, parameter three is the 2nd and so on.

By utilising HAMMER’s IronPython functionality (requires NET 4.0 runtime), VBA routines can access the full power of the .NET platform with data passed back and forth using tables. Admittedly for many complex .NET utilities utilising VB.NET or C# may be a better approach (due to better IDE and debugging features of such languages) but for standard library calls, IronPython is an ideal option. It also has the benefit that the “code” can be stored within the workbook.

HAMMER also offers the power of multi-threading to VBA via its internal threading functionality (requires Excel >= 2007 and Net4.0 runtime). The multi-threading example in the hammerThreads.xlsx workbook could easily be wrapped in VBA code, perhaps to allow it to be controlled by a user-form.

I’ve added two new commands specially designed for use within VBA scripted HAMMER scenarios:

  • APPDB – Opens an application-wide shared in-memory database. This will allow tables (and Python objects) created in one function call to be accessible in an other function call (assuming both issue the APPDB command as their 1st command). This replicates the functionality of microETL which by default exposes a application-wide SQLite in-memory instance and a common Python workspace.
  • CLOSEAPPDB – This will close and clear the shared c#-SQLite and IronPython instances. Equivalent of microETL’s xLiteReset() function.

Be careful not to if you …

  • use the APPDB instance from in-cell UDF calls to HAMMER that are likely to be scheduled as multi-threaded (the helper functions HammerToSheet & HammerToFit are safe, as they are always single-threaded),
  • or use within “internal threaded” HAMMER commands

as although c#-SQLite is thread safe, the implementation logic is not. UPDATE: Feb 2012

I’ve added new threading features to HAMMER, see here, and my implementation is now thread-safe (well I think it is!); however, the warning still applies, think twice before updating a single database instance (such as APPDB) across multiple threads – but read-only access should work fine.

Here’s a list of the HAMMER commands implemented so far …

You can download the latest version of HAMMER here …

Those with a datasmith’s hammer, see every problem as a table.

I picked the name HAMMER for my new micro ETL tool to capture the spirit of how it’s intended to be used. It’s an everyday tool, sufficient for many of the tasks a datasmith faces and when not, will help (or at least not get in the way of) with the marshalling of data that needs to be processed by more powerful tools.

The saying “If you only have a hammer, you tend to see every problem as a nail” could be rephrased as “If you only have a datasmith’s HAMMER, you tend to see every problem as a table“!. But that’s OK, as nearly every commercial IT problem can be expressed as a set of tables, and in any case, HAMMER is not intended to be a datasmith’s only tool. Its close integration with Excel, recognises the prime importance of spreadsheets to most data wranglers and its use of CSV & the SQLite database format as its persistence & transport mechanism, (à la “SQLite as the MP3 of data“) recognises that datasets will often need to be shared with other tools.

Finding the perfect tool for the job it an IT obsession that most of our customer’s care little for; it’s only when our choice of tool affects their bottom-line (i.e excessive cost or wasted time) that end users take much notice of the backroom technology. The most important skill for a datasmith, is the ability to understand the structures and forms that data can take and to be able to derive and build business value from that data. The technology is an aid, nothing more, technology (and indeed  applications) will come and go, data lives on.

HAMMER encapsulates the  minimum set of technical skills an aspiring datasmith should learn:

  • Spreadsheets, in particular Excel. No need to know every nook and cranny, nor expect it to handle every task. But do learn about pivot tables and array formulas; if you have Excel >= 2007, learn about Excel  tables (aka Lists). If you have Excel 2010, make sure to download PowerPivot. Become comfortable with “formula programming”, don’t expect  a ribbon command to solve every problem.
  • SQL – learn the basics of selecting, joining and group-by; again no need to become a master; SQLite is not only an excellent way to learn SQL, it’s also a powerful tool to have once you’ve picked up the basics.
  • Learn a scripting language –  Python is one of the best and is relatively easy to learn. Again, mastery is nice, but not essential, learn the basics of IF-THEN-ELSE logic, loops and iterations, array and list handling and string manipulation.  Your efforts at coding do not have to be pretty or elegant, just functional. Python skills will also transfer across platforms, CPython (the original and the best), IronPython (.NET and HAMMER) and Jython (JVM, here’s an cool example of Python as a scripting language to automate GUIs).

All this talk of picking the right tools brings to mind the old golf pro-am story where the amateur was constantly pointing out to the pro (Jack Nicklaus I think) what club to play. At a particularly tricky shot to the green, the pro had enough when his choice of club was again “criticised”. So, he proceeded to take out every club including woods & putters, placed a ball for each one and hit every ball on to the green.

We’re not all as talented as Jack Nicklaus, so having at least an good enough tool for the job at hand  is important. But it does show, that focusing on the end-game is what matters, not becoming fixated with a  particular way of doing things.

Enough of the moralising, and talking of being fixated on a particular tool  😉 here’s this week’s list of new features to be found in HAMMER:

New commands:

TXT – will load a text file such that each line is a new row in a single columned (column named “text”) table.

LOADDB – like OPENDB, opens the named database but only if the database file exists (OPENDB will create a new empty database if file not found). Intended primarily as end-point to request-process-response loop, see DELEGATE/SUBMIT  below.

SAVEDB – saves the current “main” database using the previous argument as it’s name. (shorthand for … “main”,”filename.db”,”SAVENAMEDDB”).

SUBMIT – same as SAVEDB, but the value of the prior argument has a “.request” appended to make a file name in the format “databasename.request” for saving to. Also, if the previous argument = “” or “{GUID}” will generated a globally unique name using  a GUID. The main use-case for this command is to send the saved database for processing by an external program, maybe a Talend or Keetle ETL job , or a HAMMER.EXE job.

DELEGATE – same a SUBMIT, but doesn’t do any processing (i.e. all commands other than DELEGATE, which is expected to be the last command, are ignored), instead it’ll package the request in the saved database with the expectation that an external HAMMER.EXE, or another Excel-based HAMMER call, with do the processing.

Changes and helper functions to support DELEGATE/SUBMIT processing:

The DELEGATE and SUBMIT commands are how HAMMER implements its version of steam-powered servers.

The intention is that databases named with extensions of “.request” are picked up and processed by “servers” waiting for such files. (The transport of such files between server and the served is a separate issue, might be simply a matter of placing files on a shared drives, or DropBox!) Theses servers may then populate that database with new data ( or not, e.g. might generate a CSV). When finished, a text file of the same name but with the “.request” replaced by a “.response” is written to indicate that processing is complete.

Both the Excel udf HAMMER and HAMMER.EXE (non-excel command line version of UDF function), have been changed such that when a single argument is passed, that argument is expected to be a DELEGATE generated “.request” database. The database’s “pickled” arguments and data will then unpacked and the “delegated request” processed.

HAMMER.exe, if started with no arguments, implements a simple DELEGATE “server”, i.e. will wait for “.requests” files in its default folder and process each file as above.

Three RTD helper functions have been added to HAMMER udf:

waitOnHammerResponse(database (without the .response),[folder name]) – will return “Waiting” until it finds a database.response file, will then return the original .request filename (i.e. not the response file as it’s simply a marker, the database.request will  contain original data and any new data generated).

waitOnHammerRequest(database (without the .request),[folder name]) – will return “Waiting” until it finds a database.request file, will then then return the request filename.

waitOnFile(filenaname,[folder name]) – like the functions above but without the response/request expectations.

Here’s a list of the HAMMER commands implemented so far …

Get the latest version of HAMMER here …

Using PowerPivot to Hammer home some facts

From my previous post’s example of a Hammer use-case, it’s obvious I primarily see Hammer (and indeed microETL) as a tool for shaping dimensional type data; i.e. relatively low volume, often very ‘dirty’, but very high (business) value.

Fact (aka transactional data) can of course be handled, particularly when already reduced or when by-nature low volume, such facts will in many cases fit easily in-memory.

But when facts start to run into the millions of records, traditional in-memory manipulation becomes a problem. Obviously such large volumes datasets should in the first instance be handled IT-side utilising tools that are designed to handle such volumes, i.e. enterprise-class databases. But sometimes data, even large transactional databases, are “thrown-over-the-wall” with limited support offered (or accepted). But again, there’s plenty of cut-down versions of enterprise RDBMS available (SQL Express, Oracle Express etc.) plus the FOSS offerings such as MySQL or PostgresSQL, that can be configured user-side to help tame these beasts.

If you’re using PowerPivot, you have another option, PowerPivot itself. With its ability to quickly load and compress large volumes of data and its ability to perform many data cleansing tasks by means of row-context DAX formulas, often that’s all that will be required.

The typical problems that a transactional dataset can throw up, such as data split over two tables (header and detail) or needing to replace a surrogate date key with an actual date (to enable certain DAX date functionality to work) can easily be fixed within PowerPivot.

One thing to note about fact data presented as a Header-Detail set, is that traditional star-schema design requires that such data be flattened to the lower “grain” of the detail line, but PowerPivot doesn’t actually require you to do that. Some dimensions can link to the header (example Customer on Invoice Header) and others to the line (example Product on Invoice Line). The detail line table is still the “hub” of the “star” but one of its dimensions (the header table) is its route to a multiple other dimensions. Not classic star schema design, but it’ll work , and good for quick and dirty analysis (might be situations though where things might not pan-out as expected see this, perhaps best stick with pure star-schemas for complex work).

There’ll come a time however, when you’ll be faced with the problem of manipulating large datasets outside of traditional  RDBMS servers and outside of PowerPivot. Combining sets of data from multiple sources, as in my previous post, would be a prime example. Such projects often operate on a “need-to-know-basis” often with those supplying the data ‘outside the loop’. Today’s additions to HAMMER should help.

Three new commands, ATTACHDB, ATTACHDBINMEMORY and SAVENAMEDB, will allow external disk-based databases to be attached to the default HAMMER database.

ATTACHDB requires a filename  followed by an alias name for the attached db. Having an attached external database would allow, for example, a large fact table in CSV format to be loaded (and indexed) without touching memory. This could also be done using the previously introduced OPENDB command, but the benefit of ATTACHDB is that other non-memory-threatening processing can continue to take place in-memory.

The ATTACHDBINMEMORY also attaches an external database, but this time loads it into memory, so any changes made will not be automatically persisted back to disk. To do that, use the SAVENAMEDDB command.

This requires an attached database alias, followed by the file name to save the database to. SAVEASNAMEDB has other uses such as making backups or making copies of data for use in an external debugging platform (it can be much easier to debug Python using a proper IDE).

Along side the facility to load data via CSV/TSV, I’ve also added an ADO command. This requires a valid ADO connection string, followed by either a table/view name or a SQL Select statement. It uses ADODB 2.7, to enable handling of modern Access file formats, I’ll eventually make an ADO.NET version to remove this dependency.

Finally, I’ve managed to get a DBAPI compliant SQLite provider working in the PYTHON step. The provider is called XLite (it’s a modified version of Jeff Hardy’s SQLite3 library) and exposes most of the same functionality as CPython’s SQlite3 provider.

The library can open external SQLite databases, so offering another means of accessing non in-memory data and accesses the HAMMER default databases via the pre-connected pyDB variable. Having the ability to lazy-load rows via a cursor loop is also very useful in reducing memory foot-print when dealing with large tables. (see the IrelandFOIExample_hammer_test2.xlsx for an example of using XLite).

Update:

I’ve removed the dependency on a specific ADODB library (figured out how do equivalent of CreateObject in .NET for COM libraries). Also included is a first pass at a command line version (hammer.exe). Example:

hammer.exe mydata.csv CSV myotherdata.csv CSV JOIN > new.csv

hammer.exe inv.csv “CSV” “sum(qty)” REDUCE sum.csv TOCSV

Here’s a list of the HAMMER commands implemented so far …

Download  the latest version of HAMMER from here …

HAMMER a new Excel ETL tool for the PowerPivot age …

So, why am I developing this new datasmithing tool when I already have microETL to do the same task?

To be honest, when I started, my aim was to simply port  a lot of my useful framework code to .NET as I’m doing more and more work in pure .NET these days.

It’s a lot easier to do so with a project driving the work rather than the drudgery of line by line conversions simply for the sake of conversion.

But over the last few months of investigating the process of moving some of more useful code over to .NET; discovering C#-SQLite as a near drop-in replacement for SQLite; realising that IronPython 2.7 marks the coming-of-age of Python under .NET; discovering the ease of multi-threading programming under Net 4.o; I began to see the potential of building a tool more suited to this new emerging world of technological plenty; of 64bit, multi-core high-RAM machines playing host to ground-breaking tools such as PowerPivot, than the 32bit, single-threaded  world that microETL was born into.

I wanted to bring microETL back to its roots as an in-cell function. To make it sit more easily with the spreadsheet functional style of programming, a style of programming that has been more successful at attracting “civilians” to programming that any other method. To make the tool more in tune with the Excel-way of doing things.

At the same time, I also wanted the tool to be “detachable” from Excel, so that it could perform ETL functions server-side without the need for Excel. Ideally capable of being run from the command-line from a single no-install-required executable.

And so, the Datasmith’s HAMMER was born.

So where does this fit in with PowerPivot? When I design something like HAMMER I normally craft it around a serious of “use cases”. Below would be a one of my PowerPivot & HAMMER  use cases:

The Data Quality Team

The sales & marketing department of a wholesaling company have discovered PowerPivot and are excited about using it to help manage an upcoming merger with a recently purchased competitor, also in the wholesaling business. There’s a large overlap between customers and products supplied, and the sales structures, pricing and KPIs used by both companies are very different. The transition phase from two separate companies to a single operating unit will need detailed  planning and constant monitoring.  Oh, and there’s also the problem of the brand new ERP system that’ll replace the existing systems currently in use.

The group can see how PowerPivot models will help with this task but are worried about the sourcing and managing the data. They decide to appoint a small data quality team, with a mixture of IT and business experience; which will be responsible for mapping old  to old,old to new, managing the data deltas (business will still goes on, new customers, new products etc.).

Most of this work revolves around “master data”. Transactional data may be much larger in volume, but get the master data right then transactional data will be easy to handle as long as the processing capacity to handle the volumes is available (which thanks to PowerPivot and really fast high-RAM PCs it is).

In golf, there’s a saying “You drive for show, but you putt for dough”. Likewise in data: a lot of emphasis is put on the big-data transactional datastores, but real analytical benefit comes from the less sexier master-data, aka the dimensions. As a result the most important output from the data quality team will be a set of conformed dimensions.

Each week (and at month-end) the team will get extracts of the master and transactional data in various formats from both legacy systems and will also get the WIP master datasets from the team working on the new ERP system. From these they’ll construct point-in-time conformed-dimensions combined old with old and old with new; and will rebuild transactional feeds with new conformed keys alongside existing keys. These datatsets are then sent to the various analysis teams to enable them build Excel and PowerPivot models that will hopefully all “sing from the same hymn sheet”.

And how will this be accomplished? With the datasmith’s HAMMER of course (well it is my day-dream use-case!).  Both in-Excel and command-line HAMMER “scripts” will be used to wrangle the datasets and the complete sets packaged as SQLite database files.

When the analysis groups receive these files, they too will use HAMMER to extract the datasets they require (either straight to CSV for loading into PowerPivot) or into Excel for some further transformation work prior to modelling. To handle slowing-changing dimension scenarios, many of the teams will squirrel away each week’s data in other to be able to model yet-unknown-but-very-likely changes of organisational structures.

Although this is a use-case, it’s based on reality, a reality that would have been so much easier to manage if I, and the teams I worked with, had access to technology such as HAMMER and PowerPivot.

To download the latest version of the code, go to this page on my website.

Follow the HAMMER tag on  this blog for information on commands and examples (best start with the oldest and work forward …)