From my previous post’s example of a Hammer use-case, it’s obvious I primarily see Hammer (and indeed microETL) as a tool for shaping dimensional type data; i.e. relatively low volume, often very ‘dirty’, but very high (business) value.
Fact (aka transactional data) can of course be handled, particularly when already reduced or when by-nature low volume, such facts will in many cases fit easily in-memory.
But when facts start to run into the millions of records, traditional in-memory manipulation becomes a problem. Obviously such large volumes datasets should in the first instance be handled IT-side utilising tools that are designed to handle such volumes, i.e. enterprise-class databases. But sometimes data, even large transactional databases, are “thrown-over-the-wall” with limited support offered (or accepted). But again, there’s plenty of cut-down versions of enterprise RDBMS available (SQL Express, Oracle Express etc.) plus the FOSS offerings such as MySQL or PostgresSQL, that can be configured user-side to help tame these beasts.
If you’re using PowerPivot, you have another option, PowerPivot itself. With its ability to quickly load and compress large volumes of data and its ability to perform many data cleansing tasks by means of row-context DAX formulas, often that’s all that will be required.
The typical problems that a transactional dataset can throw up, such as data split over two tables (header and detail) or needing to replace a surrogate date key with an actual date (to enable certain DAX date functionality to work) can easily be fixed within PowerPivot.
One thing to note about fact data presented as a Header-Detail set, is that traditional star-schema design requires that such data be flattened to the lower “grain” of the detail line, but PowerPivot doesn’t actually require you to do that. Some dimensions can link to the header (example Customer on Invoice Header) and others to the line (example Product on Invoice Line). The detail line table is still the “hub” of the “star” but one of its dimensions (the header table) is its route to a multiple other dimensions. Not classic star schema design, but it’ll work , and good for quick and dirty analysis (might be situations though where things might not pan-out as expected see this, perhaps best stick with pure star-schemas for complex work).
There’ll come a time however, when you’ll be faced with the problem of manipulating large datasets outside of traditional RDBMS servers and outside of PowerPivot. Combining sets of data from multiple sources, as in my previous post, would be a prime example. Such projects often operate on a “need-to-know-basis” often with those supplying the data ‘outside the loop’. Today’s additions to HAMMER should help.
Three new commands, ATTACHDB, ATTACHDBINMEMORY and SAVENAMEDB, will allow external disk-based databases to be attached to the default HAMMER database.
ATTACHDB requires a filename followed by an alias name for the attached db. Having an attached external database would allow, for example, a large fact table in CSV format to be loaded (and indexed) without touching memory. This could also be done using the previously introduced OPENDB command, but the benefit of ATTACHDB is that other non-memory-threatening processing can continue to take place in-memory.
The ATTACHDBINMEMORY also attaches an external database, but this time loads it into memory, so any changes made will not be automatically persisted back to disk. To do that, use the SAVENAMEDDB command.
This requires an attached database alias, followed by the file name to save the database to. SAVEASNAMEDB has other uses such as making backups or making copies of data for use in an external debugging platform (it can be much easier to debug Python using a proper IDE).
Along side the facility to load data via CSV/TSV, I’ve also added an ADO command. This requires a valid ADO connection string, followed by either a table/view name or a SQL Select statement.
It uses ADODB 2.7, to enable handling of modern Access file formats, I’ll eventually make an ADO.NET version to remove this dependency.
Finally, I’ve managed to get a DBAPI compliant SQLite provider working in the PYTHON step. The provider is called XLite (it’s a modified version of Jeff Hardy’s SQLite3 library) and exposes most of the same functionality as CPython’s SQlite3 provider.
The library can open external SQLite databases, so offering another means of accessing non in-memory data and accesses the HAMMER default databases via the pre-connected pyDB variable. Having the ability to lazy-load rows via a cursor loop is also very useful in reducing memory foot-print when dealing with large tables. (see the IrelandFOIExample_hammer_test2.xlsx for an example of using XLite).
I’ve removed the dependency on a specific ADODB library (figured out how do equivalent of CreateObject in .NET for COM libraries). Also included is a first pass at a command line version (hammer.exe). Example:
hammer.exe mydata.csv CSV myotherdata.csv CSV JOIN > new.csv
hammer.exe inv.csv “CSV” “sum(qty)” REDUCE sum.csv TOCSV