Thursday, May 26, 2016

Building ML Pipelines


What is involved in building a machine learning pipeline? Here is a common flow:

  • Data pre-processing
  • Feature extraction
  • Model fitting
  • Validation stages
Learn more about ML pipelines (from a Spark perspective).


Wednesday, May 25, 2016

Tuesday, May 24, 2016

Spark Structured Streaming - Crazy Like a Fox

Big Data related computing has matured greatly over the past several years from its early and humble Map-Reduce days. Hadoop introduced developers and enterprises to mainstream distributed computing on commodity hardware and with a software stack (largely Java based) that was accessible to the average developer.

Early versions of Hadoop did not have the most developer friendly APIs, but they made breaking up large computing tasks and iterative processing possible to scale without big iron and SMP hardware. Things evolved and improved with the emergence of memory efficient Big Data engines such as Apache Spark. This has also been helped by the fact that memory prices keep dropping.

A lot of attention has been given to Apache Spark these days as the successor to Hadoop. The big advantage that Spark is touted to have over Hadoop is how its Map-Reduce engine leverages distributed memory to improve performance over classic Hadoop. While this is true, the broader Hadoop ecosystem has been evolving rapidly as well, so this alone is not at the heart of what has given Apache Spark such a big leap forward.

What is often underestimated in the growing popularity of Spark, is its API. If you have ever tried to write a Map-Reduce type job in Java Hadoop 1.x or 2.x you would understand. Spark is API plural with support for Scala, Java, Python and R. The way you build data processing pipelines and construct transformations and aggregations in Spark is well thought out by the authors of Spark.


Sparks is not standing still either. With the development of Spark Streaming, Spark SQL, DataFrames and DataSets in the Spark API, Spark is making the development effort of manipulating data and writing processing logic much more intuitive for developers. The elegance of the Spark API is a key part of the reason why Spark has grown in popularity.

One knock on Spark is that it is now being obsoleted by the next wave of compute fabric engines that are built from the ground up to be realtime streaming centric. Many claim that this streaming first architecture is superior to Spark's batch based architecture for both general purposes processing and especially for streaming operations. Products such as Storm, Fink and Apex, just to mention a few, have garnered a lot of attention. The claim is that by using a streaming first architecture, these engines can do both batch processing and streaming more efficiently than Spark does batch and micro-batch bases streaming.
What is often left out of such as debates is again the API. If you have ever tried to write a Storm processing stream you will know what I mean. So again here, this is where Spark shines with its more intuitive APIs. 

Now this is where we get to Spark's new API coming out in the soon to be release Apache Spark 2.0. Spark will be introducing a new Structured Streaming API that will unify streaming, batch and Spark SQL. The Spark team is raising the productivity bar with how developers use APIs by unifying the building of both batch and streaming applications.

The idea is that a streaming application is really a "continuous application" and that the best way to build a streaming application is not reason about streaming altogether. In other words, Spark 2.0 with Structured Streaming, will make building streaming application no different than building any other Spark application. The streaming aspect is essentially declarative and the Spark engine will do the work of optimizing the stream. The big advantage this has for developers is that we can continue to think of our applications in the same way whether they are doing streaming or batch.

Spark 2.0 with advent of Structured Streaming will leapfrog Spark ahead of the other competing streaming first engines by removing the stream design complexity while at the same time brining Spark's elegance to building APIs to the forefront.

At then end of the day, Spark's well designed APIs will prove to be pivotal for developers. Developer productivity and Spark's fast evolving optimized engine  (Tungsten...etc) will offer a hard to beat combination of developer productivity and raw scalable performance. The idea of having a programming model that does not require a developer to reason  about a stream and instead let them focus on the higher order functions of their application will in the end prove more superior vs the harder to use streaming first engines such as Storm and the like. This unified programming model also frees the Spark engine to evolve the low-level streaming plumbing over time without impacting developers.

Wednesday, May 18, 2016

Fluent Interfaces a Beautiful Thing


Fluent programming interfaces when down right are an elegant thing to behold (for a programmer). They require no specialized learning verses what it would take to build and model the same sort of domain logic in an external DSL. While specialized DSL's have their place, they create a challenging ecosystem to support and impose the need for additional moving parts outside the core development of the application and system. When the dedicated long-term resources are applied to supporting a DSL, there is no doubt external DSLs can be a powerful thing. But in the absence of this, Fluent interfaces are a powerful software programming pattern.


Here is a good video presentation describing the pros and cons of fluent interfaces vs using external DSLs. The presentation provides a pragmatic perspective from a point of personal experience in the industry.

Like anything, fluent interfaces can be abused, but when used with good intentions they can create easier to build, read and maintain software. What are good examples of fluent interfaces? There are many examples and I have noticed more frameworks and APIs supporting. Cassandra's Java driver is one example (QueryBuilder) and frameworks like Apache Spark and other general map/reduce data flow processing APIs make great use of fluent interfaces.

Here is a snippet of code I borrowed from Martin Fowlers post on the subject that gives a before and after example of using a fluent API:

private void makeNormal(Customer customer) {
        Order o1 = new Order();
        customer.addOrder(o1);
        OrderLine line1 = new OrderLine(6, Product.find("TAL"));
        o1.addLine(line1);
        OrderLine line2 = new OrderLine(5, Product.find("HPK"));
        o1.addLine(line2);
        OrderLine line3 = new OrderLine(3, Product.find("LGV"));
        o1.addLine(line3);
        line2.setSkippable(true);
        o1.setRush(true);
    }

private void makeFluent(Customer customer) {
        customer.newOrder()
                .with(6, "TAL")
                .with(5, "HPK").skippable()
                .with(3, "LGV")
                .priorityRush();
    }

So, while fluent interfaces don't give you the power of a full fledged external DSL, they can be a productive boost to any API you are building. So give fluent interfaces a look at in your next framework, they can make your code easier to build and maintain.