Excel Document-Oriented Database with Python Map & SQL Reduce

In a previous post Excel as a Document-Oriented noSQL database I discussed why a document-oriented approach to system design can be very useful due to the ease of setup and subsequent pain-free evolution of data models that comes with a schema-less datastore. Two new features I’ve added to microETL make using Excel as the basis for a document-orientated system much easier.

The two features are:

  • JSON support, load and manipulate JSON documents; very useful for communicating with webs services that increasingly are abandoning XML for JSON as their preferred inter-change format; but also very useful for serialising (in a simple textual format) all sorts of documents especially Excel data-loads.
  • SQL Virtual Tables, for example virtual table vtCSV; works directly with a CSV file, including writing back any changes to the file. Another set are vtVBA, vtScript and vtPY; allowing for bespoke VTs to be created via VBA, SQL Script or Python. It is these three VTs that are proving useful for crafting document-oriented systems.

Having the ability to use JSON as document serialisation format is the first step (prior to this I would have had to use XML or CSV) allowing me to store a document’s data without recourse to a schema. But how to view and analyse the documents? In document-oriented databases such as CouchDB, documents are likewise stored as JSON objects, while to fetch and analyse, these tools tend to use a variation on MapReduce; where a Map function gathers the subset of documents and whatever data items required, then either passes that back as a “view” or passes on the set to a Reduce function to do things such as SUM or AVERAGE etc.

MicroETL uses a variation on this MapReduce approach. Using the virtual table facility, Map functions are written in VBA, SQLScript or Python. Such Map functions have a standard signature, where by the 1st argument is the name of a SQLite table that the function must create and fill with the resulting “view”. That table will then be available to SQL to act as the Reduce method (sum(),avg() etc.) or simply as a conduit for passing back the table to Excel.

An Example

Say you use Excel to record your employees’ time-sheets. Each week you send each employee a partially filled out (StaffNo and WeekNo) time-sheet workbook. (You used to this manually but now a microETL-powered workbook does this automatically).

The employees fill out their sheets, and mail them back to your payroll clerk. In the past the clerk manually opened each mail, opened the attachment, transcribed the hours and overtime worked into another workbook, manipulated it and exported as a CVS to make it suitable for batch-load into the payroll system. Painful. But again, you’ve now automated this using microETL.

You’re thinking of replacing your existing, cheap and cheerful, payroll system with a new super-super one that handles direct remote time-sheet entry and sophisticated build-up-to-gross calculations. But here’s the rub, you’re current build-up-to-gross process is simple (standard hours one rate, overtime 30% extra), but change is afoot, you’re hoping to buy-out some of the current overtime spend and convert  it to standard hours, plus a whole raft of other changes. But before that, you need hard figures on what’s driving overtime and even then, you’ll have to introduce the necessary changes piecemeal, department by department. When you explained this to the Payroll System’s sales rep, he was very happy, happy to quote for the extra expense of such a plan. You’ve decided to stick with spreadsheets at least until you’ve decided on, and implemented, the new overtime regime.

Here’s the current timesheet:

The “Reason for Overtime” is free-format text, so the first thing to do is to codify the reasons and change the standard time-sheet template to use a drop-down menu. To help identify the reasons and to analyse the existing 3 years worth of time-sheets you hold, you use microETL to serialise the data into a document-oriented SQLite-based table. Using something like this:

The SQLScript is called for each workbook and produces a JSON object like so …

…which is stored in a SQLite table like this …

CREATE TABLE timedocs(id text,key text, date datetime,doc text);

…where doc holds the JSON object, id is a MD5 hash of the object, and key is the business key. To help quickly search the table a FTS-3/4 SQLite full-text-search index is created on the doc column (remember JSON is just text!). This will allow the free-formatted reason attributes to be quickly and easily analysed.

A number of vtPY “views” are then applied. One such, produces a table in the format required by the current payroll system – StaffNo,WeekNo,Standard Hours, Overtime Hours. You use this every week to load the weekly payroll run, but will use it for this exercise to validate you’ve fetched the full 3 years history by comparing against payroll’s history.

Lo and behold, they don’t agree, due to a lot of manually mis-typed hours resulting in over-payments (under-payments were always caught by the affected employee 🙂 ), but at least there’s none since you’ve automated the process.

Here’s the Python Map function:

An here’s SQL doing a Reduce to get totals by WeekNo.

SELECT WeekNo,sum(Standard),sum(Overtime) from table(vtPY,Python!PayrollList) group by WeekNo

The real beauty of this approach to Really Simple Systems, is that the core of the process remains under the control, and within the capability, of business users. Even if the writing of Map functions (in Python or VBA) is outside their comfort-zone (as would be the case for most) and even if “SQL Reduce” skills are scarce on the ground, the design and deployment of the source documents should be within the competence of Excel-savvy end-users. So at least data can be gathered immediately and handled manually while waiting for a power-user or IT resource to add the automation layer.

Such systems have resilience built-in. If, for example, you were using a cloud-hosted  web site to allow employees to download and upload time-sheets and that service fails, you could revert back to using email (or use Dropbox!).

Same for any automation elements that break over time, reverting back to “manual” is doable, whether or not to invest in re-automating being a simple cost-benefit analysis comparing on-going manual effort to automation costs.

Likewise, you could use an externally hosted website as your “gateway” but construct the upload and download links to use behind-the-firewall URLs, so taking advantage of the cost-benefits and simplicity of the cloud while keeping hard-data securely within your firewall.

These sort of benefits don’t just apply to data-collection systems but to reporting and data-provisioned systems such as micro-ETL and micro-BI initiatives.  And of course, the sweet-spot, is operational-BI, mixing BI insights with operational actions; the simplicity of Excel forms and reports, provisioned by the Excel’s new native BI engine, PowerPivot!

Need help utilising Excel to its full potential?  Contact Tom



4 responses to “Excel Document-Oriented Database with Python Map & SQL Reduce

  1. Pingback: Manipulating Excel tables with SQL | Gobán Saor

  2. Pingback: Excel as a document-oriented NoSQL database | Gobán Saor

  3. Vincent Baxter

    Very helpful article. The employee timesheet software I use negates a lot of the need for Excel, but it doesn’t hurt to have something to fall back onto for your own personal records.

  4. Pingback: Excel as a book of record. | Gobán Saor