Saturday, April 23, 2016

Visualizing the Data Science Disciplines




Nice visualization showing how the various data science disciples interrelate. Puts some of the hype around artificial intelligence, predictive analytics and big data in some perspective.

Thursday, April 14, 2016

Understanding Supervised vs Unsupervised Machine Learning


I always found it a bit difficult to explain how labeled and non-labeled data sets factored into machine learning algorithms and the related training/modeling process. This short explanation I found on stackoverflow helped crystalize it for me:

I have always found the distinction between unsupervised and supervised learning to be arbitrary and a little confusing. There is no real distinction between the two cases, instead there is a range of situations in which an algorithm can have more or less 'supervision'. The existence of semi-supervised learning is an obvious examples where the line is blurred.

I tend to think of supervision as giving feedback to the algorithm about what solutions should be preferred. For a traditional supervised setting, such as spam detection, you tell the algorithm "don't make any mistakes on the training set"; for a traditional unsupervised setting, such as clustering, you tell the algorithm "points that are close to each other should be in the same cluster". It just so happens that, the first form of feedback is a lot more specific than the latter.

In short, when someone says 'supervised', think classification, when they say 'unsupervised' think clustering and try not to worry too much about it beyond that.

Hope you find it useful :)

Wednesday, April 6, 2016

The Industrial IoT and the Rise of Cloud Machine Learning


The Internet of Things (IoT) has been driven by advancements in many areas of technology along with the ever expanding reach of the internet. This has made it feasible today for any device big or small to be connected to the world.

Alone, having billions of devices sharing information is more or less noise. The IoT is of little value unless businesses and industries can turn raw contextual data into valuable and actionable information. We are at a turning point across all industries where the volumes of data being generated have the potential to be turned into vital business information.

The Killer App
The IoT is founded on two principles: first and foremost is the ability to efficiently collect and catalog the vast sets of data available from sensors and other internal digital systems and handling this in a timely manner. Second, it is about creating machine learning models and the related analytics that can drive predictive and prescriptive decision making opportunities for the owners of these devices and industries. With the ability to enable any device big or small to be connected, the opportunities for data gathering and intelligent decision making based on large values of timely data will propel many industrial killer applications that can turn the data into value that can be used to optimize business functions and business operations. The opportunities for an industry or business to create their own IoT killer apps is at its beginning - there are countless opportunities across all markets. The efficiencies created by using data gathered from every corner of your business will drive huge opportunities for business optimization and efficiency.

Much of the initial interest in the IoT started around consumer and retail types of scenarios for such things as tracking and monitoring consumers or optimizing product movement in supply chain scenarios. While is this is all good, we are now moving beyond the consumer aspects of this and the IoT is now invading the world of the Industrial Internet where the potential benefits have the opportunity to create tremendous business efficiencies that dwarf the opportunities in the consumer and retail opperations space and this can ultimately offer profound benefits to human advancement.

Get Your ETL Groove
Predicting failures and optimizing maintenance/operations have tremendous value across all industries from healthcare to aviation, but this is the end result of an overall process. Some of the less glamours aspects, that ultimately enable machine learning and predictive analytics, lie in the challenges of collecting this wealth of data that the downstream machine learning and analytics that are based on the data. Whether you are wind turbine power plant or a railway operator, collecting data from field operations is no small task. You can not get to useful machine learning models without gathering the data needed to train and feed your models. This often is an obstacle for industries not accustomed to collecting data using a Big Data mindset (velocity/volume/variety and context), but is a critical first step to be conquered.

Leveraging the Cloud
Here is where cloud services and machine learning PaaS solutions can help move industries from design to live deployments. Moving data into the cloud for cleansing and ETL processing is the first step to prepare your data sets for consumption by your data wranglers and data scientists. The good news is there are many startups popping up helping with this end to end process. Machine learning services are a new market for startups, and it is definitely worth looking into leveraging such services if you don't have the capacity to build the data wrangling, machine learning and predictive analytics yourself.

Leveraging the cloud is a great option for many businesses, and there are already many options to choose from. You can build your own from the ground up on an IaaS cloud environment or leverage the growing list of small and PaaS big cloud services providers coming on line. One interesting trend is the focus on domain specific machine leaning and analytics for the IoT. Companies such as Predikto, for example, focus on predictive analytical services for targeted vertical industries (rails and aviation in this case). I think we will see an increase with startups focusing an abstracting away the technology complexity and plumbing and offering end users more on the end to to end services geared toward a particular market and industry. This focus on vertical domains also aligns well with how machine learning models need to be tuned and optimized and how each vertical industries tailors its own predictive and prescriptive decision making.

Take Some Action
As we move past the first phase of the IoT where it has been about tracking and monitoring the many connected devices, in the next phase industries will be wanting to make actionable sense out of their data in ways that can improve business efficiencies and replace slow reactive human decision making with real-time decision making based on machine learning models. The ultimate goal is to reach a point where the decision making is prescriptive and even potentially AI powered, but we are a long way from that. For many businesses, it is about wrangling their data and enabling the "Things" they are tapping into to be able to feed to the cloud in order to build and drive their machine learning models and downstream analytics.

The comparison I like to make with how the IoT is transforming business, across all industries, is similar to what happened with financial marketplaces with the advent of digital trading platforms and high frequency trading systems. The financial trading platforms of today collect and monitor vast amounts of data points and everyone is looking for the most timely and  actionable information in order to beat the next guy. This is what is happening with the IoT, in large part. It is bringing to the surface vast amounts of data from every corner of a business and making it actionable. However, it will take time for the many industries, from farming to manufacturing to get their machine learning bearings. Again, don't look to build of it all yourself. There are many cloud and big data resources and even full services startups specializing in your industry that can help.

Updating Machine Learning Models
The process of making predictions and anticipating the future is based on building accurate models of your industrial world. These models are often not static. Models change over time and as the many variables that impact your business change and as your business itself grows and evolves. So a common consideration in machine learning is how to keep evolving your models. A good post on this subject can be found here. This article describes how time-series prediction (where historical data is vital to the model - like in the case of weather modeling) and how feedback data (data as a business grows - retail store start to sell new products) impacts machine learnings models and can trigger machine models to be retrained from scratch. Understand how your models evolve is an important aspect of machine learning, because without accurate models you have bad information. You are only as good as your models and keeping them up to date is a constant effort by your data wranglers and data scientists.

It is also important to appreciate that observing your business can change it as well. So you must always be looking at retraining your models as you use predictions that optimize your business. The process of optimizing your business (for the better hopefully) requires changes and updates to your machine learning models on a regular basis.

IoT + ETL + ML Models + Cloud = Optimizing Business
There are many considerations when beginning your Industrial IoT journey. There are no short-cuts, and the effort requires investing in and developing new skills and leveraging new technologies, but the journey will profoundly change your business for the better.

Friday, April 1, 2016

Law of Parsimony Strikes Back

Let me first start off by saying (hate it when people start off saying this - usually means some principled BS is coming) that many new Big Data technologies such as the concept of Map/Reduce, Machine Learning and products such as Hadoop, Spark and NoSQL databases are great tools to have in your IT arsenal. Also don't forgetting other infrastructure technologies such as hardware virtualization, software containers and other micro-services deployment architectures that are making IT environments more flexible and more manageable (note, this does not mean simpler). There is no doubt these technologies fit a number of problem domains that in the past where very hard to do with standard computing stacks, IT tooling, and relational database technology.

Now having said that, let's be careful not to over apply them and end up with a system that is fragile and takes an army (or perhaps small army) of super smart operations people to deploy and run. I always go back to one of my favorite principles, "the law of parsimony" or sometimes referred to as Occam's razor. Boils down to the reality that nature has a habit of looking for the simplest path to solve a problem. We can't sometimes see simple elegance although. Take for example nature's design of the common leafy tree. Wile it has complex structure, this structure comes from some very simple principles.

I feel we are at a point with technology where we are advancing at a great pace, but in the process of doing this we are creating a lot of complexity. As Occam's razor states, complexity is a relative consideration among the alternatives, so the bar is always moving up with what we consider to be complex, but we sometimes need to step back and not use a bulldozer when a shovel will do the job just as good and will not break down on us when we need it most.


I feel that way with a lot of technology I see being applied. There are so many options to choose from that sometimes the simplest option for the problem is overlooked. This might be human nature, "he who dies with the most toys wins", but this can be a costly mistake for many businesses.

I often joke that if you give me a bunch of plain old app servers (pick your favorite) and a relational database (pick your favorite), I can move the world (might need a load balancer in there somewhere ;). So, when you are in your next architecture design meeting ask yourself this question, can I bend my tools to my will or do I need new toys to play with :)