Monday, January 11, 2016

Big Data Warehouse with Cassandra & Spark


Enterprise Data warehousing (EDW) has traditionally been the realm of big iron databases such as Oracle, IBM and other vertical storage engines such as Teradata. With the rapid evolution of Big Data in the past few year, the market has begun to shift away from monolithic and highly structured data storage engines that lack inherent support for the tenants of Big Data.

While data warehousing (DW) design has traditionally implied denormalization and focusing on data structures that are more in tune with the applications using it (sounds a bit like NoSQL philosophy don't it), many of the Big Data storage options and NoSQL databases lack some of the needed functionality (at least out of the box) to allow for the needed ad-hoc querying capabilities and analytics required to support a data warehousing solution.

Enter into the picture Cassandra and Spark. These are two products that together can allow you to build your own robust and flexible data warehousing and analytics solution,  and doing this while running on top of a big data centric compute and storage grid environment. Together Cassandra and Spark complement each other to allow for flexible data storage and rich query and analytics processing and computing.

Cassandra is widely known in the industry for its modular scaling, built-in partitioning and replication. Cassandra's query interface (CQL), has some of the benefits of SQL while allowing for the benefits of NoSQL semi-structure data and wide column scaling and sparse row capablites. But with many of Cassandra's powerful NoSQL features come inherent limitations such as the ability perform aggregations operations and rich analytics functions within Cassandra. And as with all NoSQL (non relational) storage engines, joining tables is not something offered by Cassandra. These are significant gaps to building a data warehouse.


This is where Spark and Spark's integration with Cassandra fills the feature gap needed for Cassandra to deliver the capablilies necessary for a fully capably data warehousing platform. Spark's data management capabilities via RDDs (Resilient Distributed Datasets) and Sparks powerful distributed compute fabric combine to provide the ability to build a robust and highly scalable storage and analytics data warehousing solution.

One of the big benefits of building your DW solution on Cassandra and Spark is you get all the benefits of Big Data scaling (compute and storage scaling) while running on commodity hardware and while leveraging Spark's elegant programing interfaces (Scala, Java, Python, R). And with Spark you have room to build machine learning and other deep analytics on your data and without the lock-in and limitations of legacy big iron data warehousing engines.

Rollup your selves and start your own journey to build your next Big Data Warehouse using Spark and Cassandra.