Cloud Analytics & ML with Sam Taha

Thursday, June 11, 2020

Is ML Curve Fitting The Best We Got?

Curve Fitting is for the most part what most machine learning boils down to, not that that is a bad thing. How do go be beyond the correlation of the black box? I see the rediscovery of symbolic AI and the introduction of casualty into purely probabilistic ML analogous to what happened in software decades ago when we evolved from assembler and procedural languages and we started to model software/data as richer abstractions with relationships. Not the same thing, but a similar evolution in engineering and computer science.

Causal relationships exist in the world and can influence how we collect our data and engineer the features that drive our ML model training. This includes everything from how we analyze covariance in the data and in how we manage and monitor data distributions. Collecting data and engineering features is not enough. Understanding causal relationships can sometimes be gleaned from the data we observer, but often times we must look at how we can develop experiments and interventions with A/B test strategies and multi-armed banded processes to uncover the causality in order to better train our models.

Intervention and experiments can help us answer some "what if questions" and then you have counterfactual, which are beyond the reach of most experiments, yet understanding causal relationships have the potential to offer us insights and help business make better since of the world and their opportunities. We need better tools and engineering processes to incorporate these skills into our ML frameworks and ML processes.

This is starting to happen in AI and ML today across disciplines that are applying ML. This is a good article on the topic that I suggest all ML engineers and data scientists to read.

Friday, June 5, 2020

Choosing an ML Cloud Platform: GCP vs AWS

ML cloud services are evolving fast and furious. GCP and AWS are the leading players. Here is a quick visual peak at both ML tech stacks.

AWS has SageMaker as the centerpiece:

Then there is GCP with its Kubeflow angle and on-premises hybrid cloud options:

Tuesday, June 2, 2020

Cloud OLAP: Choosing between Redshift, Snowflake, BigQuery or other?

Which to choose for your cloud OLAP engine? There are a lot of choices when it comes to cloud based analytics engines. All the major clouds have their homemade solution (GCP/BigQuery, AWS/Redshift, Azure) and their are plenty of independent options from Snowflake to Databricks to mention a few.

Which is right for your business and in what situation? Needs can vary from internal data exploration to driving downstream analytics with tight SLA. I am a strong proponent of the approach that no matter what you do that you have start with a foundational data lake blueprint and you then choose to build that with either an open source analytics engine on top of your cloud data lake or license a commercial analytic engine such as Redshift, Snowflake or BigQuery.

There is no one answer without looking at your business needs, existing technical foundation and strategic direction, but I have to say have I am getting more impressed with Snowflake as the product matures. Without getting to deep into the details, Snowflake is sort of an in memory (backed by public cloud object storage) data lake with a highly elastic in-memory MPP layer. There are many pros and cons in selecting the best option for your business. The edge Snowflake has it is cloud agnostic (sort of the Anthos of the data cloud) and I really like their cross cloud and data center replication feature (recently released feature) and cross cloud management.

If you want to discuss how to approach making this decision process look me up!