Tuesday, February 2, 2016

Spark Processing for Low Latency Interactive Applications

Apache is typically thought of as a replacement for Hadoop MapReduce for batch job processing. While it is true that Spark is often used for efficient large scale distributed cluster type processing for compute intensive jobs, it can also be used for processing low latency operations used in more interactive applications.

Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.

The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.

First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well. 

Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.

By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.

With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.

Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!

Monday, February 1, 2016

Temporal Database Design with NoSQL

Tracking data as it changes over time in a database is a common requirement for many applications and data warehousing systems. Knowing when a data element or group of elements have changed and over what period of time the data is valid over, is often a required feature in many applications and analytical systems.

Supporting this type of time management functionality using a traditional RDBMS such as MySQL or Oracle is well understood by data modelers and programmers. Such temporal data modeling can be done in a variety of ways in relational database for both OLTP and OLAP style applications. For example, Oracle and IBM DB2 have built-in extensions for managing bitemporal time at the table and schema level. It is also possible to roll your own solution with any of the major RDBMS engines by applying time dimension columns appropriately to your schema and then with the appropriate DML manage the updating and insertion of new change records. To do this precisely and 100% consistently the database is required to support durable ACID transactions, something most RDBMS have in spades. See wikipedia links for background on temporal database models.

Now this is all great, temporal and bitemporal table design is a well understood concept by data architects in the RDBMS world. Now how do you do this if you are on the Big Data and NoSql bandwagon? To begin with most NoSQL databases lack support for ACID transactions, which is prerequisite for handling temporal operations on slow changing dimensions (temporal data) and bitemporal dimensions (valid time dimension and transaction time dimension). ACID transactions are required in order to properly mark expired records as new records as new records are being appended. Records must never overlap.

NoSQL databases such as Cassandra and Couchbase are powerful database engines that can be leveraged for a wide segment of data processing and storage needs. NoSQL databases offer many benefits including built in distributed storage/processing, flexible schema modeling and efficient sparse data management. Many of these benefits come at a price that limit NoSQL database applicability in cases where durable ACID transactions are required for scenarios such as managing slow changing time dimension record management.

To address this functionality gap, NoSQL and traditional ACID databases can be combined in a data serialization and deserialization API with a schema design pattern that provides a polyglot database framework that supports temporal and bitemporal data modeling and provides a data access and query API. So leverage your favorite NoSQL database without compromise. This approach can be applied to NoSQL databases in both OLTP, data warehousing and Big Data scenarios. Get your polyglot engines going, your favorite NoSQL database just got bitemporal!

Monday, January 11, 2016

Big Data Warehouse with Cassandra & Spark

Enterprise Data warehousing (EDW) has traditionally been the realm of big iron databases such as Oracle, IBM and other vertical storage engines such as Teradata. With the rapid evolution of Big Data in the past few year, the market has begun to shift away from monolithic and highly structured data storage engines that lack inherent support for the tenants of Big Data.

While data warehousing (DW) design has traditionally implied denormalization and focusing on data structures that are more in tune with the applications using it (sounds a bit like NoSQL philosophy don't it), many of the Big Data storage options and NoSQL databases lack some of the needed functionality (at least out of the box) to allow for the needed ad-hoc querying capabilities and analytics required to support a data warehousing solution.

Enter into the picture Cassandra and Spark. These are two products that together can allow you to build your own robust and flexible data warehousing and analytics solution,  and doing this while running on top of a big data centric compute and storage grid environment. Together Cassandra and Spark complement each other to allow for flexible data storage and rich query and analytics processing and computing.

Cassandra is widely known in the industry for its modular scaling, built-in partitioning and replication. Cassandra's query interface (CQL), has some of the benefits of SQL while allowing for the benefits of NoSQL semi-structure data and wide column scaling and sparse row capablites. But with many of Cassandra's powerful NoSQL features come inherent limitations such as the ability perform aggregations operations and rich analytics functions within Cassandra. And as with all NoSQL (non relational) storage engines, joining tables is not something offered by Cassandra. These are significant gaps to building a data warehouse.

This is where Spark and Spark's integration with Cassandra fills the feature gap needed for Cassandra to deliver the capablilies necessary for a fully capably data warehousing platform. Spark's data management capabilities via RDDs (Resilient Distributed Datasets) and Sparks powerful distributed compute fabric combine to provide the ability to build a robust and highly scalable storage and analytics data warehousing solution.

One of the big benefits of building your DW solution on Cassandra and Spark is you get all the benefits of Big Data scaling (compute and storage scaling) while running on commodity hardware and while leveraging Spark's elegant programing interfaces (Scala, Java, Python, R). And with Spark you have room to build machine learning and other deep analytics on your data and without the lock-in and limitations of legacy big iron data warehousing engines.

Rollup your selves and start your own journey to build your next Big Data Warehouse using Spark and Cassandra.

Wednesday, December 2, 2015

No Compromise Database with NoSQL & Apache Spark

Database technology has been going through a renaissance over the past several years. Relational databases have matured steadily over the past couple of decades, however relational databases were not well equipped to deal with the data volume, velocity and variety (three Vs) that is now demanded by the world of social apps, mobile, IoT, and Big Data - just to name a few.

We are now seeing many new database engines coming to the market (commercial and open source) geared to servicing paritcular applications domains and functional verticals. There is some awsome innovation happening, but the common theme you see with the vast majority of these databases is that they give up something from the traditional relational database world to achieve the level of, for example, CAP theorem suite spot they are aiming for or volume/scalability/throughput they are trying to achieve.

The most common tradeoff given up by many of the NoSQL database engines, for example, is the elimination of table or entity joining. Joining data sets is a fundamental part of the relational model that allows for modeling data using a normalization approach and having a schema that can server multiple application scenarios. This approach is different with NoSQL database. When designing a NoSQL database schema the modeling of the schema/data (or lack of schema - less rigid schema) is very tightly coupled with how the applications will use the schema. So NoSQL databases tradeoff the strong typing of the relation world but push more complexity to the application tier.

The fact that joining is missing from many of the popular NoSQL engines (Cassandra, MongoDB...) puts more complexity on the application tier to help offer functionality such as combining and mashing different data sources together. For example, trying to do a join between to data sets pulled from two different tables or storage engines can be complex and hard to scale in the application tier. Enter Apache Spark into the picture. With Spark, application developers can use Spark's grid computing capabilities to perform database engine type operations without reinventing the wheel in the application layer and while at the same time leveraging a highly scalable compute grid and memory management grid with built-in rich data transformation operations (RDDs, map/reduce, filters,  joins...).

Combining Apache Spark with your backend application services is a powerful way to scale NoSQL databases by allowing for rich data operations across multiple tables, documents and polyglot data sources. And this can be done while leveraging Sparks very rich and expressive APIs and highly scalable processing and memory caching.

So Spark is not just for petabyte scale Big Data number crunching and machine learning tasks. You can use Spark in your traditional data management tier to join desperate data entities and use it for rich data processing operations typically provided by relational databases. With Spark you get the benefits of NoSQL without compromise.

Embed Spark into your backend application tier and give Apache Spark a spin, it will change how you build backend services forever.

Wednesday, November 18, 2015

Understanding Apache Spark - Why it Matters

Apache Spark has come on the scene in the past few years and has taken the computing world by storm. It is dubbed as the replacement for Hadoop and often seen as the next evolution in Big Data. Spark is one of the most active Apache projects and has developed a strong ecosystem. Even the Big Data players themselves are adopting it in their stack and positioning it as a key player in their overall open source and productized solutions.

Why has Spark been so successful? How is it better or different than the first incarnation of Big Data (aka Hadoop). Well Spark does not abandon the principles that were realized by Hadoop and companies that helped bring the Big Data philosophy to the masses. Spark builds on the basic building blocks of such technologies, such as HDFS and programming constructs such as Map-Reduce and it does it in a way that makes building application on top of Spark much more efficient and effective than its predecessors.

Spark like Hadoop supports building a computing fabric that can be deployed and can run a commodity type hardware and inherently supports horizontal scaling. Spark lowers the barriers for helping application developers parallelizable their applications and spreading the computing and data access on a cluster of computers for processing. Hadoop does many of the same thing, but Spark does it better from both a technology implementation perspective (more efficient use of memory, garbage collection handling...) and much better application programming API.

What Spark does is raise the bar from a programming interface perspective. It has strong support for Java, Scala, Python and R. Its core operations for managing data (such as RDDs) and computing are very well designed interfaces and APIs. When working with Spark you still have to look at your application and the problem you are trying to solve and think how to parallelize it, but the Spark APIs are intuitive to understand and to use for the typical application programmer. Spark gives you the tools to essentially access the same power a grid computing platform has or distributed database engine might have internally and makes it available to the average programming to embed that same sophistication in their own application.

Spark is a game changer. It can be used for everything from ETL to basic application OLTP computations that drive a GUI to backend batch processing to real-time streaming applications and graph modeling. Spark is truly a game changer that will bring some of the powerful technology pioneered by the internet giants for leveraging distributed computing into applications at levels of the enterprise. Strap your boots and starting learning Spark. It is the next evolution in not just Big Data but in general purpose application programming that can leverage true distributed grid computing and bring it to the programming masses.

Monday, July 27, 2015

Unbundling Database Architecture: Turning Databases Inside-Out

Relational database technology has been around for a few decades now. In the last several years we have seen a resurgence of innovation around data storage and data processing. This has pushed us into the realm of thinking outside of traditional SQL and big iron monolithic computing.

NoSQL, NewSQL and distributed commodity/cloud storage is changing how we build persistence into our applications. However the fundamentals of databases have not changed much. Lower cost memory and the availability of cheaper cloud computing has created a lot of innovation, but how databases function under the hood has not changed very much.

The fundamentals of how transaction atomicity, replication and considerations such as CAP theorem are still tackled in much the same way as they were with the earlier database engines. But is there a different way to look at how applications manage persistence for OLTP type of transactions? Well, Apache Samza presents an interesting approach to how data is managed. While it takes things from a streaming centric approach, this could present a new way for how applications can manage general data storage in the future.

Here is an interesting blog that presents a breakdown how the Apache Samza architecture and how this can facilitate more general purpose application data management by using an "unbundled" architecture in the heart of the database engine. Is this just another specialized data storage engine geared toward steaming data and analytics, or a whole new way to think about database architecture?

Sunday, June 7, 2015

Isomorphic Web Apps: Back to the Future, Again

As web application development evolves, we continue to see the pendulum swing between client and server. Over the past two decades we have moved from simple multi-page HTML applications that are rendered exclusively on the server to ultra fat single page applications (SPA) containing more javascript than anyone would have imagined a few years ago.

Over the past couple of years, many large hosted sites (i.e. Airbnb, Facebook and others) have run into challenges with building heavy javascript client apps and have rediscovered the value of rendering some of the web content on the server. Technology such as Node.js has made this easier and so has the creation of frameworks such as ReachJS. This rediscovering of using the server for rendering UI now has a new cool name, Isomorphic Javascript. The name seams to have stuck, so we will need to add it our lexicon :)

The technology around this new approach is gaining some steam of late. Here is a good blog from from Airbnb on what led them to consider this architecture for their hosted web application services. While the idea for moving away from SPA has been around for while, it is gaining more steam of late and we will for sure start to see more of the established front-end JavaScript frameworks incorporating it in one way or another as well as new frameworks such as ReachJS.

ReactJS is one of the more popular frameworks that leverage server side rendering and that advocates for this hybrid web application development. While Node.js is the leading container for supporting this application delivery model, we will start to see JVM support and integration as well with Java 8 Nashorn.

There are many benefits to building your web application with an isomorphic javascript architecture that I will try to cover in an up coming blog. There are already some good blogs covering the subject. Also expect AngularJS 2.0 to offer support for server side rendering, but we will have to wait and see what Google comes up with as AngularJS 2.0 gets further along.

So keep an eye out for this new twist in web application development. It will will be a boost for mobile development as well since mobile can certainly benefit from some server-side offloading of processing. But like most things, this new technology approach is no free lunch. Isomorphic javascript does add some complexity to constructing your web applications. Some of this maybe alleviated as web application frameworks evolve and as HTML web component standard mature. Stay tuned.