Nice visualization showing how the various data science disciples interrelate. Puts some of the hype around artificial intelligence, predictive analytics and big data in some perspective.
Saturday, April 23, 2016
Thursday, April 14, 2016
I always found it a bit difficult to explain how labeled and non-labeled data sets factored into machine learning algorithms and the related training/modeling process. This short explanation I found on stackoverflow helped crystalize it for me:
I have always found the distinction between unsupervised and supervised learning to be arbitrary and a little confusing. There is no real distinction between the two cases, instead there is a range of situations in which an algorithm can have more or less 'supervision'. The existence of semi-supervised learning is an obvious examples where the line is blurred.
I tend to think of supervision as giving feedback to the algorithm about what solutions should be preferred. For a traditional supervised setting, such as spam detection, you tell the algorithm "don't make any mistakes on the training set"; for a traditional unsupervised setting, such as clustering, you tell the algorithm "points that are close to each other should be in the same cluster". It just so happens that, the first form of feedback is a lot more specific than the latter.
In short, when someone says 'supervised', think classification, when they say 'unsupervised' think clustering and try not to worry too much about it beyond that.
Hope you find it useful :)
Posted by Sam Taha at 8:33 AM
Wednesday, April 6, 2016
The Internet of Things (IoT) has been driven by advancements in many areas of technology along with the ever expanding reach of the internet. This has made it feasible today for any device big or small to be connected to the world.
Alone, having billions of devices sharing information is more or less noise. The IoT is of little value unless businesses and industries can turn raw contextual data into valuable and actionable information. We are at a turning point across all industries where the volumes of data being generated have the potential to be turned into vital business information.
The Killer App
The IoT is founded on two principles: first and foremost is the ability to efficiently collect and catalog the vast sets of data available from sensors and other internal digital systems and handling this in a timely manner. Second, it is about creating machine learning models and the related analytics that can drive predictive and prescriptive decision making opportunities for the owners of these devices and industries. With the ability to enable any device big or small to be connected, the opportunities for data gathering and intelligent decision making based on large values of timely data will propel many industrial killer applications that can turn the data into value that can be used to optimize business functions and business operations. The opportunities for an industry or business to create their own IoT killer apps is at its beginning - there are countless opportunities across all markets. The efficiencies created by using data gathered from every corner of your business will drive huge opportunities for business optimization and efficiency.
Get Your ETL Groove
Predicting failures and optimizing maintenance/operations have tremendous value across all industries from healthcare to aviation, but this is the end result of an overall process. Some of the less glamours aspects, that ultimately enable machine learning and predictive analytics, lie in the challenges of collecting this wealth of data that the downstream machine learning and analytics that are based on the data. Whether you are wind turbine power plant or a railway operator, collecting data from field operations is no small task. You can not get to useful machine learning models without gathering the data needed to train and feed your models. This often is an obstacle for industries not accustomed to collecting data using a Big Data mindset (velocity/volume/variety and context), but is a critical first step to be conquered.
Leveraging the Cloud
Here is where cloud services and machine learning PaaS solutions can help move industries from design to live deployments. Moving data into the cloud for cleansing and ETL processing is the first step to prepare your data sets for consumption by your data wranglers and data scientists. The good news is there are many startups popping up helping with this end to end process. Machine learning services are a new market for startups, and it is definitely worth looking into leveraging such services if you don't have the capacity to build the data wrangling, machine learning and predictive analytics yourself.
Leveraging the cloud is a great option for many businesses, and there are already many options to choose from. You can build your own from the ground up on an IaaS cloud environment or leverage the growing list of small and PaaS big cloud services providers coming on line. One interesting trend is the focus on domain specific machine leaning and analytics for the IoT. Companies such as Predikto, for example, focus on predictive analytical services for targeted vertical industries (rails and aviation in this case). I think we will see an increase with startups focusing an abstracting away the technology complexity and plumbing and offering end users more on the end to to end services geared toward a particular market and industry. This focus on vertical domains also aligns well with how machine learning models need to be tuned and optimized and how each vertical industries tailors its own predictive and prescriptive decision making.
Take Some Action
As we move past the first phase of the IoT where it has been about tracking and monitoring the many connected devices, in the next phase industries will be wanting to make actionable sense out of their data in ways that can improve business efficiencies and replace slow reactive human decision making with real-time decision making based on machine learning models. The ultimate goal is to reach a point where the decision making is prescriptive and even potentially AI powered, but we are a long way from that. For many businesses, it is about wrangling their data and enabling the "Things" they are tapping into to be able to feed to the cloud in order to build and drive their machine learning models and downstream analytics.
Updating Machine Learning Models
The process of making predictions and anticipating the future is based on building accurate models of your industrial world. These models are often not static. Models change over time and as the many variables that impact your business change and as your business itself grows and evolves. So a common consideration in machine learning is how to keep evolving your models. A good post on this subject can be found here. This article describes how time-series prediction (where historical data is vital to the model - like in the case of weather modeling) and how feedback data (data as a business grows - retail store start to sell new products) impacts machine learnings models and can trigger machine models to be retrained from scratch. Understand how your models evolve is an important aspect of machine learning, because without accurate models you have bad information. You are only as good as your models and keeping them up to date is a constant effort by your data wranglers and data scientists.
It is also important to appreciate that observing your business can change it as well. So you must always be looking at retraining your models as you use predictions that optimize your business. The process of optimizing your business (for the better hopefully) requires changes and updates to your machine learning models on a regular basis.
IoT + ETL + ML Models + Cloud = Optimizing Business
There are many considerations when beginning your Industrial IoT journey. There are no short-cuts, and the effort requires investing in and developing new skills and leveraging new technologies, but the journey will profoundly change your business for the better.
Posted by Sam Taha at 8:57 PM
Friday, April 1, 2016
Now having said that, let's be careful not to over apply them and end up with a system that is fragile and takes an army (or perhaps small army) of super smart operations people to deploy and run. I always go back to one of my favorite principles, "the law of parsimony" or sometimes referred to as Occam's razor. Boils down to the reality that nature has a habit of looking for the simplest path to a problem. We can't sometimes see that although, take for example nature's design of the common leafy tree. Wile it has complex structure, this structure comes from some very simple principles.
I feel we are at a point with technology where we advancing at a great pace, but in the process of doing this we are creating a lot complexity. As Occam's razor states, complexity is a relative consideration among the alternatives, so the bar is always moving up with what we consider to be complex, but we sometimes need to step back and not use a bulldozer when a shovel will do the job just as good and will not break down on us when we need it most.
I feel that way with a lot of technology I see being applied. There are so many options to choose from that sometimes the simplest option for the problem is not selected. This might be human nature, "he who dies with the most toys wins", but this can be a costly mistake for many businesses.
I often joke that if you give me a bunch of plain old app servers (pick your favorite) and a relational database (pick your favorite), I can move the world (might need a load balancer in there somewhere ;). So, when you in are your next architecture planning meeting ask yourself this question, can I bend my tools to my will or do I need new toys to play with :)
Posted by Sam Taha at 11:10 AM
Saturday, March 26, 2016
Have to agree with Google on this. Innovation in Machine Learning & Deep Learning combined with serverless cloud platforms will turn more and more data into actionable information by making these data science services and functions available to a wider audience.
This comment from the article is telling where we are and where we have to go:
We also see data scientists complaining that they spend up to 80% of their time preparing the data and training the models before they can even begin to extract any value out of the current machine learning technologies. In fact, some data scientists sarcastically call themselves “data janitors,” because they spend more time preparing data than they do analyzing it.
It is currently fairly complex from an IT perspective to construct the infrastructure and services need for training and building learned models and effectively ingest data at the scale and velocity needed to turn raw data into value. This is also complicated by the challenge to find the necessary skill lets needed to make this happen. The market and landscape however is evolving fast on all fronts, both on the IT front as described in the Google article and with more IT skill set specializing coming available. Stay tuned :)
Posted by Sam Taha at 8:31 AM
Monday, February 29, 2016
Essbase was also built when requirements on analytics, reporting and visualization were much more constrained and the expectations for realtime were not as demanding as they are now (not to mention data volumes). There are many organizations using Essbase for critical business functions so streamlining the path to quicker decision making and more robust what-if type analysis is critical to being competative and optimizing the operational performance of your business.
Oracle Essbase has a number of supporting tools for reporting and business intelligence that can provide business analysts and developers with access to visualizing and drilling down into the data within the cube. But with the evolution of Big Data and new modern analytical and visualization tools, wouldn't you like to get that data you have locked up in the Essbase cube to be made accessible to technologies such as Tableau for rich and rapid visualization or wouldn't you like to have your terabytes of cube data in Cassandra and available to Apache Spark for powerful access to big data style data ETL, machine learning and mashing and correlating with other data sources?
Well, while there is no easy out of the box solution to accomplish all of this, the dream to turn your Essbase cube into another data lake that is part of your Big Data ocean and more available for rich analytics and predictive modeling and visualization is achievable with a little work.
Let's start to describe how you can do this. The first step is getting your data out of Essbase and probably the most difficult step. There are a number of ways to access data from Essbase. It first starts with understanding what "information" you want to extract. You typically don't want to directly extract the raw data that is in Essbase cube (but you could do that as well). Such data is often to granular (one of the reasons it is in a cube), so you might need to perform some aggregations (across the dimensions) and apply some business logic as you extract it. This is an ETL step that more or less demonizes that data out of the cube and flattens it out into a format that will be ideal for Tableau (further downstream in the process) and applies necessary business logic to the data to get into consumable information form. Tableau is ideal at consuming such "flattened" information given how it extracts dimensionality out of denormalized input information.
Often what is typically stored in Essbase dimensions and cells is often detailed data elements (financial, sales...etc) that might need some business transformation applied to it before extraction out of the cube. So this ETL process will prepare the data for ultimate consumption by Tableau. This is part of the art of the design and where you must understand what class of information you are after from the source raw data that is in the cube. It is part of the modeling exercise you go through and is very critical to get correct in order for the data to be in a structure it can be visualized by Tableau.
Now for the actual mechanics of extracting data from Essbase you have a few options how to do this. Essbase provides a few ways to get data out of the cube.
The diagram above shows two options for extracting data from Essbase. Smart View is one option that leverages a spreadsheet approach for extracting, transforming and flattening data out of the cube for preparation to be channeled further downstream. While Smart View is not a pure programmatic API, the excel spreadsheet capabilities allow for a lot of ad-hoc exploring POC with getting data out of the cube and it should not be underestimated what can be done with Smart View and supporting Essbase APIs and excel.
The second option shown in the diagram is using the Essbase Java API. Using the Java API allows for directly querying the Essbase database and gives very dynamic and flexible access to the cube. This can be the most robust way to get at data in the cube but is the most development intensive.
One thing to note is that Smart View and the Java API are not mutually exclusive. Behind the scenes Smart View is using the Java API and functions as a middleman service that allows excel to interface with Essbase. There is a Smart View server which exposes web services accessed by Smart View. The Smart View server (aka Analytics Provider Services or APS for short) then uses the Essbase Java API to talk natively to the Essbase server natively.
The ultimate goal of this step (whether using Smart View or Java API), is to extract the cube data that we ultimately want to see in Tableau.
The next step is storing the extracted data described in the first step. The goal here is to store the flattened data in Cassandra tables. This requires a loader custom app to take the flattened data and load into Cassandra. What is critical design up front, is whether the load process will be purge and reload, time series DW loading (fast changing dimensional data) or change data loading DW loading (slow changing dimensional data). See diagram below.
Storing the data in Cassandra will set us up for the final stage of the process which is creating the Tableau Data Extract that will deliver the final data processing stage. Note that in setting up the data for loading into Cassandra, Spark can be used to aide in the ETL process. One often overlooked feature in Apache Spark is that it is an excellent ETL tool. In fact, often times Spark deployment efforts end up involving quite a bit of ETL work to prepare data for final stage of modeling and machine learning processing. Apache Spark is a great tool for the emerging field of realtime ETL that is powering the next evolution of data movement in Big Data environments.
The next step step in the process is using the Cassandra structured data in an environment where the Cassandra tables can be made visible to Tableau for realtime extraction and modeling. This is where Apache Spark comes into the picture. Normally if you setup Cassandra as a direct data source for Tableau, you will have processing limitations as Cassandra can't perform joins and aggregations needed by Tableau, thus this will happen on the Tableau client side. However, with Spark in the picture this processing can happen within the Spark cluster.
Here is a final picture of the major components in the workflow and processing flow:
While there are some pitfalls to be weary of, this is the case in any Big Data build out. And using products like Essbase and Tableau don't make the build out any easier. It would be nice to have less moving parts, but with a sound deployment and infrastructure this architecture can be made to scale out or reliably support smaller footprint deployments.
Here a couple of useful links that describe in more details how the Spark, Cassandra and Tableau integration work:
With this architecture you get the scalability of Spark and Cassandra for both data processing and storage scale out. In addition, with this approach you avoid a common requirement with Tableau to create TDEs (Tableau Data Extracts) that are cached/stored on Tableau Server because often times source systems such as Essbase and even traditional RDBMS environments don't scale to support Tableau Server/Desktop needs for realtime aggregations and transformations. Apache Spark steps in to provide the Big Data computational backbone needed to drive the Tableau realtime visualizations and modeling. While Tableau Server is great a serving Tableau web UI and helping with some the data governance (note this is an area it is improving in), its storage and processing capabilities are somewhat limiting.
To sum things up, Essbase cubes and related reporting services are not very scalable beasts, so this is where the combination of Cassandra and Spark can help out and give Tableau a better compute backbone that can drive interactive data visualization. Hope this information will inspire you to look at using Tableau with Essbase and help you ultimately unlock the potential of your Essbase data!
Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.
Posted by Sam Taha at 1:28 PM
Tuesday, February 2, 2016
Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.
The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.
First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well.
Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.
By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.
With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.
Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!
Posted by Sam Taha at 9:00 PM