CouchDB – document centric ODS

While the potential of column-oriented DBMSs within BI projects is obvious given the popularity of MOLAP ( a form of column-oriented data store) the potential for the other new kid on the block, the document-oriented database, is less so. One such DBMS,CouchDb, is the latest wunderkid to bubble to the surface, helped by the database’s RESTful inteface , its abandonment of XML in favour of JSON, the use of Javascript (replacing a bespoke language) as its “view” language and its use of Erlang and MapReduce algorithms. (A CouchDb view is, as far as I can tell, like a combination of a Function Based Index and a Materialized View).

Where I see CouchDb’s place in a BI project is at the messy end (or should I say start) of the ETL pipe,the operational data store (ODS). Not an ODS in the high-church Inmon sense, i.e. not a normalised logical-data-model-made-real but more a easily explorable source-data archive/audit facility. If all your data comes from one or two operational systems (e.g. ERP and CRM) the need for an ODS may not arise, simply use the operational systems themselves (or direct copies in a separate database), using conformed dimensions to provide the necessary glue. If, however, a large amount of your data comes not from traditional OLTP systems but from ‘document sources’ then something like CouchDb might come in useful.

Typical document sources might be: Excel Spreadsheets, XML/JSON/CSV responses from SaaS APIs, scraped web pages, PDF/MsWord forms, MSAccess or SQLite databases; even audio/video content (e.g. market research interviews with customers which are then “codified” and stored as customer dimension attributes).

You could of course use a traditional RDBMS to hold this information especially if the database supported full-text search or has native support for semi-structured data; however, due to the huge amount of storage space that non-structured data can soak-up, CoachDb’s open source Google inspired MapReduce architecture, with its ability to cheaply scale-out, might be more suitable. Given its alpha level status, CouchDb is currently only suitable for testing or evaluation, but if you have a pressing need for such a scalable document store you could use Amazon’s S3. Although S3 is essentially just a key/value pair store, that value can be any blob of data you wish; it is in effect a massively scalable and keenly-priced document-oriented data store.

Being key/value pairs, the only indexing option is the key and although meta-data tags can be associated with each pair this data is not indexed for fast retrieval. The use of a local database to provide meta-data based filters/indices is the obvious solution; another less obvious approach would be to use a online tagging service such as The use of would of course raise privacy/security issues but these could be mitigated by using the privacy option in and by using behind-the-firewall URLs which could then be redirected to the correctly signed S3 URL via a LAN proxy.