Thursday, August 9, 2018

Choosing Between Spark ML, scikit-learn, and DNNs


Now these aren't the only considerations when deciding on how to build your data science stack and the related tooling you will need around it,  but it is a place a lot of organizations tend to begin their opening questions. Sometimes the answer may be, all the above. But you have to first reflect on your organizations goals and the level of your investment in any transformation effort, especially one that involves such a fundamental shift in how you to turn data into business value.

There are a number considerations that can influence your data science architecture that should be examined before establishing your AI platform. They include:

  1. ETL and data prep tools? AI does not work without data. Find it, mine for it and create it.
  2. Cloud, on-prem or hybrid for building your data science stack?
  3. How big is your data? Really how big is your data? Not everyone has "big data".
  4. What are you modeling? What kind of outcomes are you looking to solve for?
  5. Build, buy, partner. What kind of skills do you want to invest in for in-house data science, ML engineering and ML operations?
The bullets above are only touching on much deeper considerations that need to be assessed by any organization looking to transform their business with AI. But let's step back a bit and just discuss the question posed by the title of this blog to avoid turning this blog into a long drawn out analysis that goes down too many rabbit holes.

Spark ML
It is natural for a lot of organizations who have been doing "Big Data" to get their first exposure to data science through Spark's MLlib. Spark ML is a nice module/framework the comes with Spark and comes packaged with most major Hadoop distributions. The ML APIs and algorithms include many of the popular model building options from decision trees, to survival analysis (time-to-live), to allowing you to build recommendations engines (ALS), to unsupervised learning with clustering and topic modeling. Spark ML is nice and convenient for those coming from the Big Data universe. One nice advantage is that you can often leverage Spark's inherent distributed architecture to build models that can operate at large petabyte scale when needed. Is Spark ML ideal for all data sources and outcome objectives and it is the most efficient (you can hack DNN into if you have the stomach for it) - the answer as you might guess is obviously no. Why, well that is for another day to dive into, but suffice it to say that it may always be the most accurate way to build models and may not always be the best bang for CPU/GPU buck.

scikit-learn
Then there is good old scikit-learn. Any Python developer with a math or data background or has done any statistical modeling (or ML work) will know and likely love scikit-learn and all the other related Python packages such as numpy, scipy and pandas to name the most popular. scikit-learn is a treasure trove of algos and APIs. It is an awesome framework for ML developers and data scientists. Does it scale in same ways that Spark can - unfortunately no. But do you always really need it to? Look at your data before you answer that.

Deep Neural Nets
Then there is the new kids on the block, DNNs (back from the future). Tensorflow and PyTorch just to name a couple of the most popular are claiming to be universal function approximators that can model anything and solve for everything. Note, you will need to bring data and lots of it. They are data hungry. They can solve anything from classifications to generating word embeddings to creating generative models. There isn't much a DNN and its offshoots can't do theoretically. Through their natural fit with GPUs, they can scale fairly efficiently, and you can sometimes sort of distribute them with some extra heavy lift.

Taking your Models Live
A lot of what we just reviewed is about building and training models. Now, how do then take what we just trained and turn it into a service that predicts, classifies or generates data? That is also a topic unto its own. Operationalizing machine learning models can be non-trivial but it can also be not so difficult at times. It just depends on the model you are creating. For example, sometimes discrete bounded models can just be exported into a database, but often times the solution (input and output space) is not finite and requires creating distributed your build models as inference engines - and that is a bit more work. Then there is the nagging issue of how and when to update your models. Again another subject all together.

Buy or Build
So should you build or buy? The big boys (google, aws, azure) are all making a lot of what we just described available as MLaaS offerings (to various degrees of completeness). So stay tuned and current as the AI  technology world is changing fast.