SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

Advertisements

8 responses to “SQLite – the ultimate data-smithing tool!

  1. Pingback: Python the new VBA ? « Gobán Saor

  2. Pingback: SQLite and Excel as DataSmithing Tools | Chui’s counterpoint

  3. i’d be intested to see how you can connect excel to sqlite via odbc. it’s been impossible for me to get it to work..

  4. @devnull

    I don’t use ODBC myself for connecting SQLite to Excel but I’ve used this http://www.ch-werner.de/sqliteodbc/ in the past for connecting from Win32 based Java apps via a JDBC/ODBC bridge and it worked fine, including when I tested it from an Excel Query.

    Tom

  5. SQLite also has great support for BLOB which can be utilized for compressed serialized Python objects; see

    http://yserial.sourceforge.net

    Objects at that project are actually even more general: files and URL content — but the clearly one of the most useful is dictionaries which is essential for schema-less data. Python dictionaries can contain arbitrary objects, and such dictionaries can be accessed simply through SQLite.

    The SQLite module is part of the standard Python distribution > v2.5, so there is no need for further installation, configuration, separate server process, etc. The y_serial module uses only standard issue components, so everything that Tom mentioned becomes easily Python accessible.

    • Must give it a try, I particularly like the idea of SQLite as container for persisted dictionaries.

      Tom

    • You should also look at supporting APSW http://apsw.googlecode.com/svn/publish/couchdb.html the CouchDB via SQLite Virtual Table would offer interesting possibilities; using CouchDB as Web based master but SQLite as the local engine?

      In general the APSW approach to SQLite is much better that the standard Python library as it recognises that SQLite is different from server based rdbms.

      I use SQLite combined with Python, combined with Excel as my main datasmithing platform.

      Tom

      • hi Tom, thanks very much for your pointer to CouchDB — very helpful for those interested in NoSQL.

        That makes it very easy to import/export with SQLite, so hence with any arbitrary Python object via y_serial http://yserial.sourceforge.net

        y_serial is designed to encourage shard SQLite databases and using Python code to unify them as needed. CouchDB is more monolithic — so integration is an option when a project needs to scale big-time.