Thursday, April 15, 2021

Data Driven vs Data Model Driven Company

Somehow along the way data lakes got the rap that you can dump "anything" into them. I think this is carry over from the failed hippie free data love days of Hadoop and HDFS. No, a data lake is not a place you dump any kind of json, text, xml, log data...etc and just crawl it with some magic schema crawler then rinse and repeat. Sure you can take an approach of consume raw sources and then crawl them to catalog the structure. But this is a narrow case that you do NOT do in a thoughtless way. In many cases you don't need a crawler. 

Now with most data lakes you do want to consume in data raw form (ELT it more or less) but this does not mean just dump anything. You still must have expectations on structure and data schema contracts with the source systems you integrate with including dealing with schema evolution and partition planning. Formats like Avro, Parquet and ORC are there to transform your data into normalized and ultimately well curated (and DQ-ed) data models. Just because you got a "raw" zone in your data lake does not mean your entire data lake is a dumping ground of data of any type or your data source structures can just change at random.

Miracles required? This is what most of today's strategic AI and even BI/Analytics engineering and planning looks like. If you don't have your data modeled well and your data orchestration modularized and under reins then achieving the promise of cost effective and maintainable ML models and self-service BI is a leap of faith at best. Forget about being a data-driven company if you are not yet a data-model-driven company yet.

A data lake is a modern DW built on highly scalable cloud storage and compute and based on open data formats and open federated query engines. You can't escape the need for well thought out and curated data models. Does not matter you are using Parquet and S3 vs Snowflake and Redshift. Data models are what make BI and Analytics function.


Thursday, January 21, 2021

The AI Lesson for All of Us


There is no doubt that the brute force ML (aka deep learning) approach to achieve general AI or some level of human decision making by using more and more compute and more data has been successful over the past decade. 

I am fond of believing that there is more to AI than optimizing an objective function with more data and better hyper parameters - for example, integrating symbolic AI, knowledge graphs, causality...etc. However, trying to build systems to think the way we think we think may not be the future of AI, at least not yet. 

There is likely something beyond just bigger deep learning models - maybe it is software program synthesis or other genetically founded approaches - no one knows, as there is not enough research in these areas yet. But some form of AI is already here, self driving cars already use and construct 3D world models and utilize hand crafted rules mixed with deep learning sensor data analysis to give us the perception of AI decision making is going on. Efficiency also matters as we get into bigger and bigger models will billions of parameters. It is no joke how much energy some of the ML training (compute resources) that is required by many of these models (e.g. GPT-3). It is important to make sure we separate the hype (companies selling us on autonomous cars vs the value of some useful ML driver assistance) as companies use the AI hype to raise more capital but the reality is not aligned with the capabilities of generalized AI, at least in this current age of AI.

ML algorithms from the likes of Youtube and Facebook already manipulate our digital lives and behaviors with massive data they collect about us. Maybe AI is already here and in control and we are just the data simulation to generate more data for our AI overlords :) Anyway, my main point with sharing this post to share the post from Sutton (The Bitter Lesson) is to make us think about the data we control in business and enterprise world. Curating our data and more of it is what will still continue to drive ML and AI for the foreseeable future. So make sure to get your data quality and your data lakehouse BI/analytics in order ;)

Wednesday, December 16, 2020

2021 Data and Analytics Predictions

Cloud data platforms really gained momentum in 2020. It has been a real breakout year for both cloud data lakes and cloud data warehouses (yeah, I am making a distinction). Cloud data warehouses started several years ago with Redshift and the first iteration of BigQuery. Databricks, AWS, Presto and others re-established the data lake in the cloud and made it very SQL friendly. Redshift and BigQuery have improved and made it possible and easier to now query external data lake storage directly (partitioned parquet, avro, csv....etc) and started to blend data lakes with data warehouses (somewhat). And to top it off this year, Snowflake put a massive stamp on everything with its financial market boom and accelerating adoption.

But we are still in the early days of the cloud data platform journey. We got a ways to go. Even with the cloud many of the solutions mentioned, along with others, still lock you into their propriety data walled gardens. In 2021 we will begin to see the next evolution of cloud data lakes/warehouses. It is not enough to separate compute from storage and just leverage the endless sea of elastic cloud storage and object storage. While this is an important step forward for data and analytics platforms, we need to go still further. We need to separate the query engine itself from the data and storage. This is the next step and it will be guided in part by leveraging data virtualization and establishing the physical storage structure of the data itself upon open standards.

Data Virtualization will gain more traction (especially in the cloud) and begin to eclipse and encompass data warehousing and in particular for low latency BI and analytics where it already plays a big role. Minimizing data copying in your data lake/warehouse is important especially for your the semantic and BI layers in your lake which can often demand highly curated and optimized models.

The key building blocks will include an open data lake foundation combined with data federation and high performance virtualization query engines coupled with cloud storage. And all on open standards. Think Apache Iceberg, Apache Hudi, Delta Lake, Apache Arrow, Project Nessie and other emerging open and cloud optimized big data standards.

Solutions such as Snowflake, Redshift, BigQuery, and Databricks are still potential plug-able building blocks, but should not be confused as the sole foundation or centerpiece for your cloud data platform, otherwise you will be walling yourself off all over again with another Teradata, Netezza or Oracle, just this time in the cloud.

Thursday, November 26, 2020

Are Open Cloud Data Lakes the Future?

 

Building a cloud data platform? First question: open Data Lake or proprietary DW or maybe a mix of both? Not a simple question or architecture decision to make given the flood solutions and players in the space from the large cloud platforms to new entrants such as Snowflake.

I see the Fivetran argument from George Fraser that decoupled storage/compute cloud MPP DW engines such as Snowflake are the way to go. On the flip side I also see Dremio's Tomer Shiran argument that an open data lake on open data storage standards (apache parquet & arrow) along with data virtualization is the way to go. 

What is the right answer? Well as with most things in engineering and technology there is no one size fits all. I do believe that data virtualization in the cloud along with cloud storage has been a game changer. Presto paved the way with demonstrating that data and query federation is possible, especially in a cloud environment. While HDFS/Hadoop largely fizzled for reasons I won't get into here, Parquet, Arrow and other Apache projects have taken off and brought us the modern data lake. Big data for both compute and storage has proved its scale and manageability in the cloud. 

How much of your data to keep in a priority cloud DW vs an open cloud data lake is an important decision. There is a balance that does not lock you in totally and at the same time lets you use the best technology of the day while managing costs. Be wise.

Wednesday, August 12, 2020

The Lost Art of Data Lineage


 
Maybe it is more of mangled and ill defined art than a lost art. Data lineage is one of the aspects of data governance that gets lost in the shuffle of data analytics and data warehousing/lake projects. It is vital for many reasons, least of them compliance and auditing. Often time data lineage never makes on the train to the final destination when building large scale data warehousing and analytics solutions.

Part of the problem is that it gets turned into an all or nothing effort leading to very little getting delivered relating to data lineage or it gets turned into a mishmash of concepts and solution features. Often it gets dumped under auditing and logging, irrespective of business or technical metadata and real-time vs historical, and then forgotten.

Let's quickly breakdown what data lineage actually entails. There are three dimensions of data lineage to consider as part of a general data governance strategy. These include:

1) Logical Data Processing Flows (logical and/or visual DAG representation)
Defining the high-level visual graph and code module level relationships between data processes stagings and steps that produce and generate data models.

2) Metadata Relationship Management (low level data relationships and logic/code)
Tracking metadata relationships between data models (schemes/tables/columns) and the related source code used in the transformation. This includes showing the transformation logic/code used going from one or multiple source data models to a target data model.

3) Physical Data Processing History (what happened, is happening, and going to happen)
At a data set level (sets of records, rows and columns), this shows the record level and data set linkage between data sets that have happened in the past or will happen in the future. There is a temporal (real-time and historical) aspect to this showing transaction events from one or multiple data sets feeding a downstream data set and the transactional breadcrumbs and events involved.

Note, that the term data model denotes a static structure (more or less your schema/tables definitions) while data sets are live physical structures (the actual data at a record level) across one or multiple data models.

Before you get started on your data lineage journey you need to decide to what extent you will implement one or all of these dimensions of data lineage as part of your overall data governance strategy. There are varying degrees of exactness and completeness to each one of them as well. And make sure to keep them distinct.

No one commercial tool will do the complete job. It is usually a combination of multiple tools, hand stitched software services, and best practices/conventions that will be necessary to do the job well and depending on your criteria for success.

Thursday, July 2, 2020

Tuning the Snowflake Data Cloud


To be clear, I do not classify Snowflake as an OLAP or MPP database. It has these capabilities for sure, but being born in the cloud and only for the cloud, it has much more to offer. I consider it a "Data Fabric". Yea that is broad term, but Snowflake is really what Big Data and Hadoop were aspiring to achieve, but never did for reasons I won't get into here.

What makes Snowflake a game changer for OLAP engines and data warehousing? The below listed features, shown in the diagram, are all true and not just marketing spin. How can Snowflake accomplish this? Built in the cloud and only for cloud - what does that really mean - Snowflake takes full advantage of two key superpowers only available in the cloud. 1) elastic and virtually limitless highly durable immutable storage and 2) spinning up virtually limitless compute. Starts with these two things, and lot more in Snowflake to deliver full package solution.



If Snowflake has no developer/DBA configurable indexes, partitioning, distribution keys, vacuuming, stats tuning, storage tuning...etc, like other MPP/OLAP engines, is there anything I can really tune (be-careful with auto re-clustering)? With great power comes great responsibility. There are multiple things you can do to tune and optimize for performance. This means being careful to monitoring and manage costs because it can be too easy to scale up and out and this will cost you. From a schema modeling/design perspective there are some optimizations you can do to minimize compute scale up/out requirements (and thus costs). One of them is using cluster/sort keys, one of the few DDL things you can tune at metadata level. Also how you use materialized views and manage joins vs de-normalization are important considerations. All these things are highly dependent on downstream consumption/usage patterns. So yes you still need good data engineers and architects :)

Thursday, June 11, 2020

Is ML Curve Fitting The Best We Got?


Curve Fitting is for the most part what most machine learning boils down to, not that that is a bad thing. How do go be beyond the correlation of the black box? I see the rediscovery of symbolic AI and the introduction of casualty into purely probabilistic ML analogous to what happened in software decades ago when we evolved from assembler and procedural languages and we started to model software/data as richer abstractions with relationships. Not the same thing, but a similar evolution in engineering and computer science.

Causal relationships exist in the world and can influence how we collect our data and engineer the features that drive our ML model training. This includes everything from how we analyze covariance in the data and in how we manage and monitor data distributions. Collecting data and engineering features is not enough. Understanding causal relationships can sometimes be gleaned from the data we observer, but often times we must look at how we can develop experiments and interventions with A/B test strategies and multi-armed banded processes to uncover the causality in order to  better train our models.

Intervention and experiments can help us answer some "what if questions" and then you have counterfactual, which are beyond the reach of most experiments, yet understanding causal relationships have the potential to offer us insights and help business make better since of the world and their opportunities. We need better tools and engineering processes to incorporate these skills into our ML frameworks and ML processes.

This is starting to happen in AI and ML today across disciplines that are applying ML. This is a good article on the topic that I suggest all ML engineers and data scientists to read.


Friday, June 5, 2020

Choosing an ML Cloud Platform: GCP vs AWS

ML cloud services are evolving fast and furious. GCP and AWS are the leading players. Here is a quick visual peak at both ML tech stacks.

AWS has SageMaker as the centerpiece:


Then there is GCP with its Kubeflow angle and on-premises hybrid cloud options:


Tuesday, June 2, 2020

Cloud OLAP: Choosing between Redshift, Snowflake, BigQuery or other?


Which to choose for your cloud OLAP engine? There are a lot of choices when it comes to cloud based analytics engines. All the major clouds have their homemade solution (GCP/BigQuery, AWS/Redshift, Azure) and their are plenty of independent options from Snowflake to Databricks to mention a few.

Which is right for your business and in what situation? Needs can vary from internal data exploration to driving downstream analytics with tight SLA. I am a strong proponent of the approach that no matter what you do that you have start with a foundational data lake blueprint and you then choose to build that with either an open source analytics engine on top of your cloud data lake or license a commercial analytic engine such as Redshift, Snowflake or BigQuery.

There is no one answer without looking at your business needs, existing technical foundation and strategic direction, but I have to say have I am getting more impressed with Snowflake as the product matures. Without getting to deep into the details, Snowflake is sort of an in memory (backed by public cloud object storage) data lake with a highly elastic in-memory MPP layer. There are many pros and cons in selecting the best option for your business. The edge Snowflake has it is cloud agnostic (sort of the Anthos of the data cloud) and I really like their cross cloud and data center replication feature (recently released feature) and cross cloud management.

If you want to discuss how to approach making this decision process look me up!

Friday, January 24, 2020

Why Spark is the Wrong Abstraction


Is the sun setting on Spark? I don't want to knock Spark and frameworks like it, they have had their moment in the sun. Spark was a reasonable and important successor to Map/Reduce & HDFS/Hadoop, but its time has come to be exiled to the fridges of the big data ecosystem and used only when absolutely necessary. Spark is still has usefulness for some specialized ETL and data processing data applications, but overall Spark can be a massive overkill and burden to program and operate (expensive too). In many cases it is inefficient both in development, troubleshooting and the overhead of infrastructure management can be expensive relative to other options.

Not Everything is a Nail

I see many projects using Spark (and tools related to it such as aws emr for example) for transforming and moving data into and out of data lakes when often times simpler tools can be used. Spark is a pretty big and complex sledge hammer and many problems can often be solved with more effective tooling. The ever growing ubiquity of serverless technology, especially database and analytics capable services, have presented engineers with many more options and it is time to dial back when it is most appropriate to consider bringing out the Spark sledge hammer.

In a lot of cases, with Spark development, you end up writing a one-off database engine for your specific ETL scenario and using Spark's distributed compute and storage abstractions and DAG engine makes that convenient. While it is possible to use Spark as a database engine of sorts, the reality is databases are better at optimization and using the available compute/storage resources. And this is specially the case with serverless technologies that support SQL as a first class citizen. For Spark SQL is really does a bolted it on.

The only big challenge is that most database platforms are not designed for the cloud or for elastic compute/storage like Spark sort of is. I say sort of, because Spark leaves too much responsibility on the developer/DevOps to make data/compute optimization decisions and infrastructure decision which is something databases are intrinsically good at.

Declarative vs Imperative Analytics

Now, there are Spark serverless type of services as well (like aws glue and other managed Spark services), but given the general purpose nature of Spark, this still leaves optimizations and resource alloation a challenge for developers. What are the alternatives? I really like Presto and in particular the serverless aws flavor of it (Athena) as well as services like BigQuery. Tools like these are the future of big data ETL and analytics. Spark can still be useful in heavy data transformation scenarios and complex feature engineering, but not as a general data analytics engine and data movement engine. Streaming is one specialized area where solutions like Spark can still play, but there are many other solutions better designed out of the box for streaming and cloud scale-out. Spark has in many respects tried to be all things to all people. It has continuously expanding support for SQL semantics and has incorporating APIs for streaming...etc. This has made Spark a versatile framework and API for developers, but as a general purpose ETL and data analytics engine, I think there are now other options.

While the sun may not completely set on Spark, and tools like it, the declarative power of SQL will win in the end over the imperative programming model of Spark. This has been proven time and time again in the database and analytics tech space.

Tuesday, January 21, 2020

Data Lakes before AI/ML/Analytics (cart before horse thing)


Don't start or continue your AI and predictive analytics journey without building the necessary data infrastructure underpinnings. And that starts, first and foremost, with building a cloud data lake that is designed to meet the data and compute hungry needs of your . Why build a cloud data lake first?

1) Economics
2) Elastic compute
3) Elastic Storage
4) Storing (almost) everything
5) ML Model engineering
6) Feeding downstream analytics
7) Feeding downstream operational data stores
8) Data exploration, experimentation and discovery

A cloud data lake makes all the above possible at scale.

Building a cloud data lake securely and in an architecturally effective manner is achievable and will make your downstream AI/ML/Analytics journey attainable and long-term sustainable. Don't start your journey without this foundation.



Monday, September 9, 2019

Know Where Your Data Lake Has Been?



The foundation of a good data management strategy is based on number skills including data policies, best practices, and technology/tools as shown in the diagram above. The most commonly discussed term of the bunch, from the ones listed in the diagram, is Data Governance. This term is tossed around a lot when describing the organizational best practices and technical processes needed to have a sound data practice. Data governance can often be ambiguous and all encompassing, and in many cases what exists in  organizations falls short of what is needed in our modern big data and data lake oriented world where ever increasing volumes of data are playing an ever more critical role in everything a business does.

What is often left out and is missing in many modern data lake and data warehousing solutions are the two lesser known cornerstones I show in the diagram:  Data Lineage and Data Provenance. And without these additional pieces your data lakes (with ever increasing volumes and variety of data) can quickly become an unmanageable data dumping ground.

Who needs Data Lineage and Data Provenance management tools, APIs and visualization services?
  • Data Engineers (building and managing data pipelines)
  • Data Scientists(discovering data & understanding relationships)
  • Business/System Analysts (data stewardship)
  • Data Lake / BI Executives (bird's eye view of health and sources/destinations)
  • Data Ops (managing scale and infrastructure)
  • Workflow/ETL Process Operators (monitoring & troubleshooting)
Most of the ETL processing and dataflow orchestration tools out there in the market (open source and commercial) such as NiFi, Airflow, Informatica, and Talend among others,  do not directly address this gap. What is the gap? The gap is knowing where your data is coming from, where it has been and where it is going. Put another way, having visibility into the journey your data takes through your data lake and overall data fabric. And doing this in a lightweight fashion with out a lot of complex and expensive commercial tools.

Let's spend a bit of time talking about data linage and data provenance in particular and why they are important parts of a modern and healthy overall data architecture. First let's touch on the broader Data Governance ecosystem.

Data Governance

Data Governance can be an over used term in the industry and sometimes is all encompassing when describing the data strategy, best practices and services in an organization. You can read a lot of differing definitions of what data governance is and is not. I take a simple and broader definition for data governance which includes:
  1. Data Stewardship - org policies for access/permissions/rights
  2. Data Map - location of data
  3. MDM - identifying and curating key entities
  4. Common Definitions - tagging and common terminology
  5. Taxonomy/Ontology - the relationships between data elements 
  6. Data Quality - accuracy of data
  7. Compliance - HIPAA, GDPR, PCI DSS...
These are all important concepts, yet the do not address the dynamic nature of data and the ever more complex journey data is taking through modern data lakes and the relationships between data models residing in your data lake. This relates to needing to know where is your data going and where did it come from. This is where data lineage and data provenance comes into play to compliment data governance and allow data engineers and analysts to wrangle and track the run-time dynamics of data as it moves through your systems and gets combined and transformed on its journey and its many destinations.

 

Data Control Plane == Data Lineage and Data Provenance

I view data lineage and data provenance as two sides of the same coin. You can't do one well without the other. Just like digital networks have control planes for visualizing data traffic, our data lakes need a Data Control Plane. And one that is independent of whatever ETL technology stack you are using.

In our modern technical age, data volumes are ever increasing and data is being mashed and integrated together from all corners of an organization. Managing this from both a business level and technical level is a challenge. A lot of ETL tools exist today to allow you to model your data transformation processes and build data lakes and data warehouses, but they all consistently fall short of giving you a comprehensive and I would say orthogonal and independent view of how your data is interconnected; where your data came from, where it is right now and where it is going to go (and where it is expected to go next) in its journey through your data fabric and destinations.

Many modern ETL tools let you build visual models and provide some degree of monitoring and tracking, but these tools are proprietary and can't be separated from the ETL tools themselves which creates lock-in and does not allow one to mix and match best of breed ETL tools (Airflow, aws step functions, lambda, Glue, Spark, EMR...etc) that are now prevalent in the cloud. If you are using multiple tools and cloud solutions it gets ever more complicated to have a holistic view of your data and its journey through your platform.

This is why I strongly believe that data lineage and data provenance should be completely independent from what underlying ETL tooling and data processing technology you are using. And if it is not, then you are both locking yourself in unnecessary and greatly limiting the potential of your data ops teams, data engineers and limiting your overall management of the data and the processes carrying your data through its journey in your data lake.

Data Provenance and Data Lineage are not just fancy words; they are a framework and tool set for managing your data and having a historical audit trail, reat-time tracing/logging and control plane graph of where your data is going and how it is interconnected.

So do not build your data lake without a benefits of Data Control Plane. Set your data free on its journey while still maintaining viability, traceability and control.

Thursday, August 15, 2019

Data Lake vs Data Warehouse


Is a data lake part of your data warehouse platform or does the data lake sit beside it? There is a fair amount of ambiguity as to what a data lake is and how it should fit into your overall data strategy. 

I believe data lakes (coupled with elastic cloud storage and compute) are a game changer in both the DW and BI world. Your data warehousing strategy should be part of the data lake not the other way around. While you don't have to throw away everything you have done or learned in your traditional ETL and DW world, the fundamentals have changed. 

To take advantage of your data and build better BI/analytics you must build atop a sold data lake foundation. And this going well beyond the many failed Big Data and Hadoop projects of the recent past that many enterprises have experienced. 

While Hadoop was a necessary step forward at the time, it was and is an evolutionary dead end - RIP Hadoop. Cloud data lakes are the future and it is more than putting your data into S3 buckets. 

Well architected data lakes are the culmination of a succinct data management strategy that leverages the strengths of cloud services and many traditional DW best practices and data governance policies.

Wednesday, May 15, 2019

R.I.P. HDFS | The Cloud Wins!


HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.

Thursday, August 9, 2018

Choosing Between Spark ML, scikit-learn, and DNNs


Now these aren't the only considerations when deciding on how to build your data science stack and the related tooling you will need around it,  but it is a place a lot of organizations tend to begin their opening questions. Sometimes the answer may be, all the above. But you have to first reflect on your organizations goals and the level of your investment in any transformation effort, especially one that involves such a fundamental shift in how you to turn data into business value.

There are a number considerations that can influence your data science architecture that should be examined before establishing your AI platform. They include:

  1. ETL and data prep tools? AI does not work without data. Find it, mine for it and create it.
  2. Cloud, on-prem or hybrid for building your data science stack?
  3. How big is your data? Really how big is your data? Not everyone has "big data".
  4. What are you modeling? What kind of outcomes are you looking to solve for?
  5. Build, buy, partner. What kind of skills do you want to invest in for in-house data science, ML engineering and ML operations?
The bullets above are only touching on much deeper considerations that need to be assessed by any organization looking to transform their business with AI. But let's step back a bit and just discuss the question posed by the title of this blog to avoid turning this blog into a long drawn out analysis that goes down too many rabbit holes.

Spark ML
It is natural for a lot of organizations who have been doing "Big Data" to get their first exposure to data science through Spark's MLlib. Spark ML is a nice module/framework the comes with Spark and comes packaged with most major Hadoop distributions. The ML APIs and algorithms include many of the popular model building options from decision trees, to survival analysis (time-to-live), to allowing you to build recommendations engines (ALS), to unsupervised learning with clustering and topic modeling. Spark ML is nice and convenient for those coming from the Big Data universe. One nice advantage is that you can often leverage Spark's inherent distributed architecture to build models that can operate at large petabyte scale when needed. Is Spark ML ideal for all data sources and outcome objectives and it is the most efficient (you can hack DNN into if you have the stomach for it) - the answer as you might guess is obviously no. Why, well that is for another day to dive into, but suffice it to say that it may always be the most accurate way to build models and may not always be the best bang for CPU/GPU buck.

scikit-learn
Then there is good old scikit-learn. Any Python developer with a math or data background or has done any statistical modeling (or ML work) will know and likely love scikit-learn and all the other related Python packages such as numpy, scipy and pandas to name the most popular. scikit-learn is a treasure trove of algos and APIs. It is an awesome framework for ML developers and data scientists. Does it scale in same ways that Spark can - unfortunately no. But do you always really need it to? Look at your data before you answer that.

Deep Neural Nets
Then there is the new kids on the block, DNNs (back from the future). Tensorflow and PyTorch just to name a couple of the most popular are claiming to be universal function approximators that can model anything and solve for everything. Note, you will need to bring data and lots of it. They are data hungry. They can solve anything from classifications to generating word embeddings to creating generative models. There isn't much a DNN and its offshoots can't do theoretically. Through their natural fit with GPUs, they can scale fairly efficiently, and you can sometimes sort of distribute them with some extra heavy lift.

Taking your Models Live
A lot of what we just reviewed is about building and training models. Now, how do then take what we just trained and turn it into a service that predicts, classifies or generates data? That is also a topic unto its own. Operationalizing machine learning models can be non-trivial but it can also be not so difficult at times. It just depends on the model you are creating. For example, sometimes discrete bounded models can just be exported into a database, but often times the solution (input and output space) is not finite and requires creating distributed your build models as inference engines - and that is a bit more work. Then there is the nagging issue of how and when to update your models. Again another subject all together.

Buy or Build
So should you build or buy? The big boys (google, aws, azure) are all making a lot of what we just described available as MLaaS offerings (to various degrees of completeness). So stay tuned and current as the AI  technology world is changing fast.