A few weeks ago I wrote a post on handling large XML datasets using MS’s SAX2 parser from within Excel. Although fast, the SAX2 parser is not as fast as the original of the species, the Expat Streaming XML parser; being written in C and with a minimalist approach Expat’s speed is to be expected and this speed can be a real time saver.
Being at heart a SQL Jockey, I also like to transfer XML data structures to relational ones at the earliest opportunity, especially when I’m in discovery mode. Small XML pretty-indented documents can be analysed by eye (wasn’t this the great selling-point, XML’s user-friendliness 😉 ), but when the dataset is large,ugly and deeply nested I prefer to load the data into a classic parent-child table such as this…
static const char* DDL = "create table FOO("
"fileName HIDDEN text,"
"xmlString HIDDEN text,"
"filter HIDDEN text,"
"LineNo HIDDEN int,"
"Offset HIDDEN int,"
"ItemNo HIDDEN int,"
As I found myself doing this repeatability I decided to code a SQLite Virtual Table which by wrapping the Expat parser would stream out the parsed data to populate such tables. The streaming data could either be consumed as it arrived or saved to a real (i.e. non-virtual) table for further analysis. In effect, creating a ‘relational DOM’, but unlike a DOM which is memory bound, it could exist as an in-memory table or as a disk-based one, making the handling and analysis of large datasets easier to achieve. The above snippet of code is the definition of the my VT_xmlPC virtual table. HIDDEN columns are a means of adding utility columns that don’t necessarily belong to the logical dataset, in this case fileName and xmlString are used to identify the XML to load.
If you’re interested in writing your own SQLite virtual tables this Dr. Dobb’s article Query Anything with SQLite offers a good introduction to the subject. Wrapping Expat adds some extra complexity as you have to deal with two sets of callbacks, SQLite’s row request events and Expat’s events, and also with nested Expat parsers; so if you’re new to virtual tables start with the Dr Dobb’s example before diving into the xmlPC.c code.
The resulting virtual table is capable of handling most XML ‘data documents’ as seen in the wild; as coded, it regards attributes as the equivalent of nested elements (with isAttribute = ‘Y’), and it doesn’t handle CDATA (but could easily be modified to so do).
To use it:
(a) load the xmlPC.dll extension
(b) create virtual table myTableName using VT_xmlPC
(c) select * from myTableName where fileName MATCH “C:\some xml file” or select * from myTableName where xmlString MATCH ” …”
The “matches” predicates filter prior to row creation, “=”s etc. filter after row is constructed. The “fileName”, “xmlString” and “filter” columns are “matches” only columns as they initiate the creation of the table (using “matches” rather than “=”s for such utility columns is my own convention not a requirement of SQlite virtual tables). The “filter” column will restrict loading to elements of that name and the children nodes of that element.
The source code link on the Dr. Dobbs article appears to be broken, the code can be found here (Note: this is an FTP link) in the 0711.zip file.
For another example of a SQLite virtual table (this time creating a “Tag Index” using FTS3) see http://blog.gobansaor.com/2010/11/12/tag-index-sqlite-star-query-part-iv/