SQL and MPP: The Next Phase in Big Data

Over the past couple years we have all by now heard about the Big Data movement. Two key enablers in this remaking of analytics, data warehousing and general computing have been the NoSQL database technology movement and the emerging Hadoop compute stack. While not directly related to each other, both NoSQL and Hadoop have become associated with the rapidly accelerating Big Data revolution as more companies look to manage larger and larger data sets more effectively and economically. NoSQL has been the new kid on the block in the database space by attempting to take applications and data to the promised land of web-scale computing where traditional relational databases have fallen short. Over the past decade, SQL and relational database technology have failed to effectively keep up with developer needs and the scaling demands of a new generation of social and data heavy applications and this has opened the door to a different approach from the NoSQL camp. Hadoop is also in the same position by promising to deliver analytics and offline batch computing power not practical or cost effective before the emergence of HDFS and Map Reduce, which are for the most part, currently only found in expensive and proprietary analtyics products.

As with many proclaimed revolution such as Big Data, this is just the tip of the iceberg as they say. Like with most transformations in technology there is more to come as these technologies penetrate into more industries and gain wider adoption and broader acceptance by the open source community and the established heavy hitters. The next wave of Big Data technology will push the edges into other domains and go beyond the offline computing boundaries of HDFS and Map Reduce. While SQL and relational database centered analytics has taken a back seat lately because of the emergence NoSQL, SQL as a domain language will get an uplift with the emergence of the next Big Data wave as we move past the basic offline Map Reduce paradigm and look towards more real-time computing engines that can enable MPP (massively parallel processing) computing. This will allow IT organizations to continue to benefit from the low cost of commodity hardware and horizontal scaling benefits brought about by Map Reduce and HDFS and now generalized further for real-time analytics.

While NoSQL has established itself as a technology that is here to stay, the traditional relational database paradigm is not gone by any stretch and still provides an invaluable ad hoc query function to analysts and developers alike. NoSQL products like Cassandra, HBase and MongoDB (to mention a few) solve a unique problem and are becoming key foundations of any web-scale computing stack whether for online CRUD apps or for offline analtyics. But that does not eliminate the need or diminish the power of relational SQL engines and SQL as a powerful expressive domain language. NoSQL is not a silver bullet but can be a powerful complementary solution to traditional relational data storage models. The NoSQL folks have used the classic engineering trade-off where they have exchanged certain features found in relational databases to gain greater horizontal scalability. I will not get into the details of this but I do not want to over simplify what NoSQL has done. At the heart of the trade-off is eliminating relationships between data entities for the benefit of allowing for greater horizontal scalability. NoSQL also give the developer a more flexible "on-read" schema model that has its benefits.

So what does this all mean? Well, expect NoSQL and the the current 1.0 Hadoop stack to continue to mature and become more mainstream - that is no-brainer. But for the next phase I see SQL (for ad hoc querying) and real-time MPP becoming part of this Big Data fabric and this will bring back the ad hoc capabilities of relational database but now with the horizontal scaling and cost effectiveness found in HDFS and Map Reduce.

You can see this next phase is already happening by just observing all the commercial products rushing to extend their traditional analtyics engines to work on top of Hadoop and all the investment going into taking Hadoop beyond its current it offline Map Reduce roots. They very from open source next generation MPP platforms, to cloud providers offering analytics as a service, to traditional data warehouse vendors extending their products to run on time of Hadoop to next generation relational database start-ups. Here is a sample of some of the players and products to watch:

Hadoop 2.0 Players

Cloudera - Impala
Cloudera is leading the charge to create a next generation open source MPP platform that builds on the core components of Hadoop (HDFS, Zookeeper, MR...etc) to enable real-time analytics of Big Data. The initiative is open source but primarily driven (at least for now) by Cloudera. This is also partly a recognition that Map Reduce and tools like Hive are fine for certain offline analytics and processing but are not a complete solution for real-time reporting and analytics.

MapR - Apache Drill
This is a similar project to Impala but channeled through the Apache organization and primary driven by MapR (Cloudera Hadoop competitor).

Hadapt
Vertical solution for Hadoop for organizations wanting a more SQL friendly interface to their Hadoop data sources.

Datameer
Another Hadoop vertical player that is trying to make analytics and reporting easier for the Hadoop stack.

Cloud Players

Google - Big Query
This is Google's cloud services that is a combination of a distributed data store coupled with a powerful SQL like ad hoc query engine (based on the Dremel language).

Amazon - RedShift
Amazon service to help businesses more economically build data warehouses in the clouds with ad hoc SQL query interface. Partially based on technology from ParAccel.

Old School Players

IBM - Netezza
While traditionally focused on enterprise data warehousing, IBM is evolving their stack to fit and play nice with Hadoop and other Big Data solutions.

HP - Vertica
HP's Big Data play. Like IBM and Terradata, HP acquired their way into the Big Data space.

Teradata - Aster Data
Teradata is a true old school player in the Big Data space when the world only centered around relational databases. Their acquisition of Aster Data changed that.

Next Generation SQL Players to Watch

NuoDB
NuoDB is the new kid on the block promising a new way to scale and build relational databases in the cloud. Their approach is more or less based on a peer to peer model that allows them to scale out (as they claim) while still delivering on the traditional capabilities of relational database such as read consistency and ACID transactions. While NuoDB is more focused on OLTP type processing its claim that it can scale horizontally while supporting a SQL relational model makes it potentially powerful for real-time analytics as well.

VoltDB
Another new age relational database engine that delivers horizontal scaling yet retaining SQL capabilities. Differs from NuoDB by taking a caching approach to meet scaling challenge.

For the next wave of Big Data innovation, the landscape is rapidly changing with both old and new industry players getting into the game. Big Data will no longer be limited to offline and long latency based analytics processing. The lines between OLTP, OLAP and Enterprise Data Warehousing are blurring as offline computing, real-time analytics and data storage models evolve and converge. Expect better technology options and improved cloud scalability at lower price of ownership as the competition heats up and the next evolution of Big Data matures. Pick a horse and run with it. Stay tuned.

Tuesday, March 5, 2013

SQL and MPP: The Next Phase in Big Data

No comments: