Wednesday, November 18, 2015

Understanding Apache Spark - Why it Matters


Apache Spark has come on the scene in the past few years and has taken the computing world by storm. It is dubbed as the replacement for Hadoop and often seen as the next evolution in Big Data. Spark is one of the most active Apache projects and has developed a strong ecosystem. Even the Big Data players themselves are adopting it in their stack and positioning it as a key player in their overall open source and productized solutions.

Why has Spark been so successful? How is it better or different than the first incarnation of Big Data (aka Hadoop). Well Spark does not abandon the principles that were realized by Hadoop and companies that helped bring the Big Data philosophy to the masses. Spark builds on the basic building blocks of such technologies, such as HDFS and programming constructs such as Map-Reduce and it does it in a way that makes building application on top of Spark much more efficient and effective than its predecessors.

Spark like Hadoop supports building a computing fabric that can be deployed and can run a commodity type hardware and inherently supports horizontal scaling. Spark lowers the barriers for helping application developers parallelizable their applications and spreading the computing and data access on a cluster of computers for processing. Hadoop does many of the same thing, but Spark does it better from both a technology implementation perspective (more efficient use of memory, garbage collection handling...) and much better application programming API.



What Spark does is raise the bar from a programming interface perspective. It has strong support for Java, Scala, Python and R. Its core operations for managing data (such as RDDs) and computing are very well designed interfaces and APIs. When working with Spark you still have to look at your application and the problem you are trying to solve and think how to parallelize it, but the Spark APIs are intuitive to understand and to use for the typical application programmer. Spark gives you the tools to essentially access the same power a grid computing platform has or distributed database engine might have internally and makes it available to the average programming to embed that same sophistication in their own application.

Spark is a game changer. It can be used for everything from ETL to basic application OLTP computations that drive a GUI to backend batch processing to real-time streaming applications and graph modeling. Spark is truly a game changer that will bring some of the powerful technology pioneered by the internet giants for leveraging distributed computing into applications at levels of the enterprise. Strap your boots and starting learning Spark. It is the next evolution in not just Big Data but in general purpose application programming that can leverage true distributed grid computing and bring it to the programming masses.