Wednesday, December 2, 2015

No Compromise Database with NoSQL & Apache Spark


Database technology has been going through a renaissance over the past several years. Relational databases have matured steadily over the past couple of decades, however relational databases were not well equipped to deal with the data volume, velocity and variety (three Vs) that is now demanded by the world of social apps, mobile, IoT, and Big Data - just to name a few.

We are now seeing many new database engines coming to the market (commercial and open source) geared to servicing paritcular applications domains and functional verticals. There is some awsome innovation happening, but the common theme you see with the vast majority of these databases is that they give up something from the traditional relational database world to achieve the level of, for example, CAP theorem suite spot they are aiming for or volume/scalability/throughput they are trying to achieve.

The most common tradeoff given up by many of the NoSQL database engines, for example, is the elimination of table or entity joining. Joining data sets is a fundamental part of the relational model that allows for modeling data using a normalization approach and having a schema that can server multiple application scenarios. This approach is different with NoSQL database. When designing a NoSQL database schema the modeling of the schema/data (or lack of schema - less rigid schema) is very tightly coupled with how the applications will use the schema. So NoSQL databases tradeoff the strong typing of the relation world but push more complexity to the application tier.


The fact that joining is missing from many of the popular NoSQL engines (Cassandra, MongoDB...) puts more complexity on the application tier to help offer functionality such as combining and mashing different data sources together. For example, trying to do a join between to data sets pulled from two different tables or storage engines can be complex and hard to scale in the application tier. Enter Apache Spark into the picture. With Spark, application developers can use Spark's grid computing capabilities to perform database engine type operations without reinventing the wheel in the application layer and while at the same time leveraging a highly scalable compute grid and memory management grid with built-in rich data transformation operations (RDDs, map/reduce, filters,  joins...).

Combining Apache Spark with your backend application services is a powerful way to scale NoSQL databases by allowing for rich data operations across multiple tables, documents and polyglot data sources. And this can be done while leveraging Sparks very rich and expressive APIs and highly scalable processing and memory caching.

So Spark is not just for petabyte scale Big Data number crunching and machine learning tasks. You can use Spark in your traditional data management tier to join desperate data entities and use it for rich data processing operations typically provided by relational databases. With Spark you get the benefits of NoSQL without compromise.

Embed Spark into your backend application tier and give Apache Spark a spin, it will change how you build backend services forever.