Wednesday, May 25, 2016

Building ML Pipelines

What is involved in building a machine learning pipeline? Here is a common flow:

  • Data pre-processing
  • Feature extraction
  • Model fitting
  • Validation stages
Learn more about ML pipelines (from a Spark perspective).

Tuesday, May 24, 2016

Hyperparameter Tuning with Apache Spark and TensorFlow

Good blog on how TensorFlow can be leveraged with Apache Spark to parallelize the tuning of Deep Learning models.

Databrick Blog:

Spark Structured Streaming - Crazy Like a Fox

Big Data related computing has matured greatly over the past several years from its early and humble Map-Reduce days. Hadoop introduced developers and enterprises to mainstream distributed computing on commodity hardware and with a software stack (largely Java based) that was accessible to the average developer.

Early versions of Hadoop did not have the most developer friendly APIs, but they made breaking up large computing tasks and iterative processing possible to scale without big iron and SMP hardware. Things evolved and improved with the emergence of memory efficient Big Data engines such as Apache Spark. This has also been helped by the fact that memory prices keep dropping.

A lot of attention has been given to Apache Spark these days as the successor to Hadoop. The big advantage that Spark is touted to have over Hadoop is how its Map-Reduce engine leverages distributed memory to improve performance over classic Hadoop. While this is true, the broader Hadoop ecosystem has been evolving rapidly as well, so this alone is not at the heart of what has given Apache Spark such a big leap forward.

What is often underestimated in the growing popularity of Spark, is its API. If you have ever tried to write a Map-Reduce type job in Java Hadoop 1.x or 2.x you would understand. Spark is API plural with support for Scala, Java, Python and R. The way you build data processing pipelines and construct transformations and aggregations in Spark is well thought out by the authors of Spark.

Sparks is not standing still either. With the development of Spark Streaming, Spark SQL, DataFrames and DataSets in the Spark API, Spark is making the development effort of manipulating data and writing processing logic much more intuitive for developers. The elegance of the Spark API is a key part of the reason why Spark has grown in popularity.

One knock on Spark is that it is now being obsoleted by the next wave of compute fabric engines that are built from the ground up to be realtime streaming centric. Many claim that this streaming first architecture is superior to Spark's batch based architecture for both general purposes processing and especially for streaming operations. Products such as Storm, Fink and Apex, just to mention a few, have garnered a lot of attention. The claim is that by using a streaming first architecture, these engines can do both batch processing and streaming more efficiently than Spark does batch and micro-batch bases streaming.
What is often left out of such as debates is again the API. If you have ever tried to write a Storm processing stream you will know what I mean. So again here, this is where Spark shines with its more intuitive APIs. 

Now this is where we get to Spark's new API coming out in the soon to be release Apache Spark 2.0. Spark will be introducing a new Structured Streaming API that will unify streaming, batch and Spark SQL. The Spark team is raising the productivity bar with how developers use APIs by unifying the building of both batch and streaming applications.

The idea is that a streaming application is really a "continuous application" and that the best way to build a streaming application is not reason about streaming altogether. In other words, Spark 2.0 with Structured Streaming, will make building streaming application no different than building any other Spark application. The streaming aspect is essentially declarative and the Spark engine will do the work of optimizing the stream. The big advantage this has for developers is that we can continue to think of our applications in the same way whether they are doing streaming or batch.

Spark 2.0 with advent of Structured Streaming will leapfrog Spark ahead of the other competing streaming first engines by removing the stream design complexity while at the same time brining Spark's elegance to building APIs to the forefront.

At then end of the day, Spark's well designed APIs will prove to be pivotal for developers. Developer productivity and Spark's fast evolving optimized engine  (Tungsten...etc) will offer a hard to beat combination of developer productivity and raw scalable performance. The idea of having a programming model that does not require a developer to reason  about a stream and instead let them focus on the higher order functions of their application will in the end prove more superior vs the harder to use streaming first engines such as Storm and the like. This unified programming model also frees the Spark engine to evolve the low-level streaming plumbing over time without impacting developers.

Wednesday, May 18, 2016

Fluent Interfaces a Beautiful Thing

Fluent programming interfaces when down right are an elegant thing to behold (for a programmer). They require no specialized learning verses what it would take to build and model the same sort of domain logic in an external DSL. While specialized DSL's have their place, they create a challenging ecosystem to support and impose the need for additional moving parts outside the core development of the application and system. When the dedicated long-term resources are applied to supporting a DSL, there is no doubt external DSLs can be a powerful thing. But in the absence of this, Fluent interfaces are a powerful software programming pattern.

Here is a good video presentation describing the pros and cons of fluent interfaces vs using external DSLs. The presentation provides a pragmatic perspective from a point of personal experience in the industry.

Like anything, fluent interfaces can be abused, but when used with good intentions they can create easier to build, read and maintain software. What are good examples of fluent interfaces? There are many examples and I have noticed more frameworks and APIs supporting. Cassandra's Java driver is one example (QueryBuilder) and frameworks like Apache Spark and other general map/reduce data flow processing APIs make great use of fluent interfaces.

Here is a snippet of code I borrowed from Martin Fowlers post on the subject that gives a before and after example of using a fluent API:

private void makeNormal(Customer customer) {
        Order o1 = new Order();
        OrderLine line1 = new OrderLine(6, Product.find("TAL"));
        OrderLine line2 = new OrderLine(5, Product.find("HPK"));
        OrderLine line3 = new OrderLine(3, Product.find("LGV"));

private void makeFluent(Customer customer) {
                .with(6, "TAL")
                .with(5, "HPK").skippable()
                .with(3, "LGV")

So, while fluent interfaces don't give you the power of a full fledged external DSL, they can be a productive boost to any API you are building. So give fluent interfaces a look at in your next framework, they can make your code easier to build and maintain.

Saturday, April 23, 2016

Visualizing the Data Science Disciplines

Nice visualization showing how the various data science disciples interrelate. Puts some of the hype around artificial intelligence, predictive analytics and big data in some perspective.

Thursday, April 14, 2016

Understanding Supervised vs Unsupervised Machine Learning

I always found it a bit difficult to explain how labeled and non-labeled data sets factored into machine learning algorithms and the related training/modeling process. This short explanation I found on stackoverflow helped crystalize it for me:

I have always found the distinction between unsupervised and supervised learning to be arbitrary and a little confusing. There is no real distinction between the two cases, instead there is a range of situations in which an algorithm can have more or less 'supervision'. The existence of semi-supervised learning is an obvious examples where the line is blurred.

I tend to think of supervision as giving feedback to the algorithm about what solutions should be preferred. For a traditional supervised setting, such as spam detection, you tell the algorithm "don't make any mistakes on the training set"; for a traditional unsupervised setting, such as clustering, you tell the algorithm "points that are close to each other should be in the same cluster". It just so happens that, the first form of feedback is a lot more specific than the latter.

In short, when someone says 'supervised', think classification, when they say 'unsupervised' think clustering and try not to worry too much about it beyond that.

Hope you find it useful :)

Wednesday, April 6, 2016

The Industrial IoT and the Rise of Cloud Machine Learning

The Internet of Things (IoT) has been driven by advancements in many areas of technology along with the ever expanding reach of the internet. This has made it feasible today for any device big or small to be connected to the world.

Alone, having billions of devices sharing information is more or less noise. The IoT is of little value unless businesses and industries can turn raw contextual data into valuable and actionable information. We are at a turning point across all industries where the volumes of data being generated have the potential to be turned into vital business information.

The Killer App
The IoT is founded on two principles: first and foremost is the ability to efficiently collect and catalog the vast sets of data available from sensors and other internal digital systems and handling this in a timely manner. Second, it is about creating machine learning models and the related analytics that can drive predictive and prescriptive decision making opportunities for the owners of these devices and industries. With the ability to enable any device big or small to be connected, the opportunities for data gathering and intelligent decision making based on large values of timely data will propel many industrial killer applications that can turn the data into value that can be used to optimize business functions and business operations. The opportunities for an industry or business to create their own IoT killer apps is at its beginning - there are countless opportunities across all markets. The efficiencies created by using data gathered from every corner of your business will drive huge opportunities for business optimization and efficiency.

Much of the initial interest in the IoT started around consumer and retail types of scenarios for such things as tracking and monitoring consumers or optimizing product movement in supply chain scenarios. While is this is all good, we are now moving beyond the consumer aspects of this and the IoT is now invading the world of the Industrial Internet where the potential benefits have the opportunity to create tremendous business efficiencies that dwarf the opportunities in the consumer and retail opperations space and this can ultimately offer profound benefits to human advancement.

Get Your ETL Groove
Predicting failures and optimizing maintenance/operations have tremendous value across all industries from healthcare to aviation, but this is the end result of an overall process. Some of the less glamours aspects, that ultimately enable machine learning and predictive analytics, lie in the challenges of collecting this wealth of data that the downstream machine learning and analytics that are based on the data. Whether you are wind turbine power plant or a railway operator, collecting data from field operations is no small task. You can not get to useful machine learning models without gathering the data needed to train and feed your models. This often is an obstacle for industries not accustomed to collecting data using a Big Data mindset (velocity/volume/variety and context), but is a critical first step to be conquered.

Leveraging the Cloud
Here is where cloud services and machine learning PaaS solutions can help move industries from design to live deployments. Moving data into the cloud for cleansing and ETL processing is the first step to prepare your data sets for consumption by your data wranglers and data scientists. The good news is there are many startups popping up helping with this end to end process. Machine learning services are a new market for startups, and it is definitely worth looking into leveraging such services if you don't have the capacity to build the data wrangling, machine learning and predictive analytics yourself.

Leveraging the cloud is a great option for many businesses, and there are already many options to choose from. You can build your own from the ground up on an IaaS cloud environment or leverage the growing list of small and PaaS big cloud services providers coming on line. One interesting trend is the focus on domain specific machine leaning and analytics for the IoT. Companies such as Predikto, for example, focus on predictive analytical services for targeted vertical industries (rails and aviation in this case). I think we will see an increase with startups focusing an abstracting away the technology complexity and plumbing and offering end users more on the end to to end services geared toward a particular market and industry. This focus on vertical domains also aligns well with how machine learning models need to be tuned and optimized and how each vertical industries tailors its own predictive and prescriptive decision making.

Take Some Action
As we move past the first phase of the IoT where it has been about tracking and monitoring the many connected devices, in the next phase industries will be wanting to make actionable sense out of their data in ways that can improve business efficiencies and replace slow reactive human decision making with real-time decision making based on machine learning models. The ultimate goal is to reach a point where the decision making is prescriptive and even potentially AI powered, but we are a long way from that. For many businesses, it is about wrangling their data and enabling the "Things" they are tapping into to be able to feed to the cloud in order to build and drive their machine learning models and downstream analytics.

The comparison I like to make with how the IoT is transforming business, across all industries, is similar to what happened with financial marketplaces with the advent of digital trading platforms and high frequency trading systems. The financial trading platforms of today collect and monitor vast amounts of data points and everyone is looking for the most timely and  actionable information in order to beat the next guy. This is what is happening with the IoT, in large part. It is bringing to the surface vast amounts of data from every corner of a business and making it actionable. However, it will take time for the many industries, from farming to manufacturing to get their machine learning bearings. Again, don't look to build of it all yourself. There are many cloud and big data resources and even full services startups specializing in your industry that can help.

Updating Machine Learning Models
The process of making predictions and anticipating the future is based on building accurate models of your industrial world. These models are often not static. Models change over time and as the many variables that impact your business change and as your business itself grows and evolves. So a common consideration in machine learning is how to keep evolving your models. A good post on this subject can be found here. This article describes how time-series prediction (where historical data is vital to the model - like in the case of weather modeling) and how feedback data (data as a business grows - retail store start to sell new products) impacts machine learnings models and can trigger machine models to be retrained from scratch. Understand how your models evolve is an important aspect of machine learning, because without accurate models you have bad information. You are only as good as your models and keeping them up to date is a constant effort by your data wranglers and data scientists.

It is also important to appreciate that observing your business can change it as well. So you must always be looking at retraining your models as you use predictions that optimize your business. The process of optimizing your business (for the better hopefully) requires changes and updates to your machine learning models on a regular basis.

IoT + ETL + ML Models + Cloud = Optimizing Business
There are many considerations when beginning your Industrial IoT journey. There are no short-cuts, and the effort requires investing in and developing new skills and leveraging new technologies, but the journey will profoundly change your business for the better.