Tuesday, February 2, 2016

Spark Processing for Low Latency Interactive Applications

Apache is typically thought of as a replacement for Hadoop MapReduce for batch job processing. While it is true that Spark is often used for efficient large scale distributed cluster type processing for compute intensive jobs, it can also be used for processing low latency operations used in more interactive applications.

Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.

The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.

First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well. 

Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.

By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.


With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.

Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!


No comments: