Cloud Analytics & ML with Sam Taha

Friday, December 31, 2021

Machine Learning is Not AI

We need to stop referring to today's machine learning as AI. It is marketing techno spin no more and no less. There is nothing intelligent about it. We are nowhere near general or even narrow artificial intelligence.

Deep Learning has Come Far

Machine learning, largely driven by deep learning has fueled an amazing explosion of solutions from image recognition to game play to language parsing and analysis, however none of this approximates human intelligence. Much like engineered systems there needs to be deeper structures and abstractions to model the world with and not just more parameters, layers and experimenting with activation functions.

Friday, November 26, 2021

Modern Cloud Data Lake/Warehouse: Don't get Locked-in All Over Again

When relational databases, data warehouses, and data marts took root in the late 1990s, our data and our database systems were just down the hall in a rack sitting in a server room. Our data resided on our own property and in server rooms we controlled. While we had physical control over our data, we were tightly bound to the software vendor's proprietary software/hardware systems, storage formats and SQL dialect of provided by the database vendor. Everything from SQL dialect to storage formats where at the time much less standardized than what we have today.

Proprietary Storage Engines

We were entrusting our data storage and query engine interfaces to a software vendor (at the time Oracle, Sybase, IBM Informix, SQL Server...etc). Our data was in a proprietary storage format controlled by the database software vendor and we are their mercy of future support and licensing for continued access to our data.

Colocation and Hosting

As the internet grew, along came specialized hosting data centers. We moved our OLTP databases, OLAP data warehouses and our app servers to secured cages sitting in a remote climate controlled and network optimized shared complex. We owned the hardware but the software managing the data was still in proprietary storage formats from the big database vendors like Oracle, SQL Server, Sybase...etc. With this transition our data moved a bit further from our control, since we were giving up some physical control and access to the infrastructure for the benefits of colocation.

Moving to the Public Cloud

Then came the public cloud and infrastructure as a service. We moved our database systems on to virtual hardware and managed storage and networking controlled by a cloud provider. We no longer own or control the physical infrastructure or managed a physical space in a data center. This had many benefits with provisioning infrastructure, remote/automated management and brought the virtually unlimited incremental scalability of cloud compute and storage. However, our data is still locked up in proprietary storage engines. Either we are using up our own Oracle or Teradata software licenses on virtual machines in the cloud or we are using more cloud native data warehouse services such as Redshift or BigQuery.

Why does all this history matter? The less control we have over our data systems, the more restrictions we will have on future opportunities for using the raw data, metadata and related processing logic (e.g. SQL, DML, UDFs....etc) and not to mention manage costs and licensing. When your data is stored in a vendors storage engine (Oracle, Teradata...etc) your data is stored in their proprietary format and you are typically limited to using only their query engine and tooling to access your data.

When your servers are in a remote site outside your property you rely on the data center for security and management. Then when you host your server in the public cloud there are additional layers of software involved from compute/storage virtualization to shared infrastructure services. These all add to the loss of control over your data and your ability access your data and utilize it without paying a service fee of some kind. This loss of control can mean lack of options (end of life scenarios for example) that impact portability, scalability and managing costs overall.

The Rise of Cloud Data Warehouse Vendors

Being dependent on a cloud provider and database software vendors (sometimes one in the same) in how you use your data needs to be front of mind in your cloud data warehouse architecture. While it is not practical to have 100% portability from cloud infrastructure providers or from your software vendors, one needs to consider how best to leverage open source and storage standards and keeping the door open to hybrid cloud options (or cloud provider portability) whenever possible when it comes to your data platform. I am a firm believer in making these conscious decisions upfront or you will just be repeating history of the last two decades of Teradata, Netezza and Oracle type lock-in, like many enterprises are trying to unwind today.

The lock-in scenarios with proprietary data warehousing storage engines has not changed much with the public cloud providers. Data warehouse engines such as Redshift and BigQuery still store the data in proprietary formats. They offer much greater data integration flexibility with other cloud services than legacy data warehouse vendors do, but you are still at the mercy of their proprietary storage.

Does Your Cloud DW Reside in Your Cloud Account?

There are now other newer players in the cloud data warehousing space with the leader being Snowflake and others coming online to provide Data-as-a-Service solutions. With SaaS data warehousing providers now your data is residing in S3 but is controlled by a data systems SaaS provider such as Snowflake in a different AWS account (or different Azure account). This does not bring you any more control over your data. With the SaaS data-as-a-service vendors, you are still at their data mercy and lock-in and even worse your data is not residing in your cloud account. It is one thing to have your data in the public cloud, it is another to have your data in another AWS account. This can be fine for for some enterprises, but this needs to be clearly understood that with solutions such as Snowflake, what kind of control (for better or worse) you are delegating to your database vendor.

So what is the solution to this lock-in?

Big Data with open source Hadoop attempted to address the proprietary lock in problem. In the late 2000s, Hadoop took off and started to at least put some of your data into open standard data formats and on commodity hardware with less vendor lock-in (to a fair degree). While Hadoop had its challenges that I won't get into here, it did usher in a new era of Big Data thinking and a Cambrian like explosion of open source data technology and democratization. Data specs such as ORC, Avro, Parquet and distributed file systems such as HDFS gave transparency to your data and modularity to managing growth and costs. You no longer depended exclusively on proprietary data storage engines, query engines and storage formats. So with Hadoop at the time we could claim to gaining some degrees and freedom and improved control over our data and software.

Well now that on-premises Hadoop is dying off (it is dead for the most part) and cloud storage engines and data lakes are taking over. Some of these cloud native storage solutions and data lake storage engines in the cloud have largely adopted the many open data standards of Hadoop (Parquet, Avro, ORC, Snappy, Arrow....etc). These cloud native data lake house products can keep you close to your data. Solutions such as Athena, Presto, and managed Databricks let you manage your data in open data formats and while storing the data on highly elastic and scalable cloud object storage.

However, other cloud data warehousing vendors have emerged and bringing back the lock-in, meaning your data resides outside your cloud account and in proprietary storage and with proprietary query engines.Vendors such as Redshift, Snowflake, BigQuery and Firebolt each have pros and cons with the type and level of lock-in they impose.

It's All About the Data Lake

Many of these engines to offer descent integration with open standards. For example Redshift, Snowflake and BigQuery all for example do allow fairly easy ingestion and export to open data standards such as Parquet and ORC. Lock-in is not a bad thing if the solution rocks and is cost effective in the long-term. Sometimes specialized proprieties compression and unique architectures do things not possible with open standards of the present day. You be the judge. Or just let your successor in four years deal with it :)

The one bit of advise I would give when building a cloud data platform, is to always base your architecture on a data lake house foundation using open data storage standards, elastic cloud storage and a distributed SQL query engine. Your choice of Redshift, Snowflake, BigQuery and other downstream storage engines and other downstream analytics are critical but secondary.

Thursday, April 15, 2021

Data Driven vs Data Model Driven Company

Somehow along the way data lakes got the rap that you can dump "anything" into them. I think this is carry over from the failed hippie free data love days of Hadoop and HDFS. No, a data lake is not a place you dump any kind of json, text, xml, log data...etc and just crawl it with some magic schema crawler then rinse and repeat. Sure you can take an approach of consume raw sources and then crawl them to catalog the structure. But this is a narrow case that you do NOT do in a thoughtless way. In many cases you don't need a crawler.

Now with most data lakes you do want to consume in data raw form (ELT it more or less) but this does not mean just dump anything. You still must have expectations on structure and data schema contracts with the source systems you integrate with including dealing with schema evolution and partition planning. Formats like Avro, Parquet and ORC are there to transform your data into normalized and ultimately well curated (and DQ-ed) data models. Just because you got a "raw" zone in your data lake does not mean your entire data lake is a dumping ground of data of any type or your data source structures can just change at random.

Miracles required? This is what most of today's strategic AI and even BI/Analytics engineering and planning looks like. If you don't have your data modeled well and your data orchestration modularized and under reins then achieving the promise of cost effective and maintainable ML models and self-service BI is a leap of faith at best. Forget about being a data-driven company if you are not yet a data-model-driven company yet.

A data lake is a modern DW built on highly scalable cloud storage and compute and based on open data formats and open federated query engines. You can't escape the need for well thought out and curated data models. Does not matter you are using Parquet and S3 vs Snowflake and Redshift. Data models are what make BI and Analytics function.

Thursday, January 21, 2021

The AI Lesson for All of Us

There is no doubt that the brute force ML (aka deep learning) approach to achieve general AI or some level of human decision making by using more and more compute and more data has been successful over the past decade.

I am fond of believing that there is more to AI than optimizing an objective function with more data and better hyper parameters - for example, integrating symbolic AI, knowledge graphs, causality...etc. However, trying to build systems to think the way we think we think may not be the future of AI, at least not yet.

There is likely something beyond just bigger deep learning models - maybe it is software program synthesis or other genetically founded approaches - no one knows, as there is not enough research in these areas yet. But some form of AI is already here, self driving cars already use and construct 3D world models and utilize hand crafted rules mixed with deep learning sensor data analysis to give us the perception of AI decision making is going on. Efficiency also matters as we get into bigger and bigger models will billions of parameters. It is no joke how much energy some of the ML training (compute resources) that is required by many of these models (e.g. GPT-3). It is important to make sure we separate the hype (companies selling us on autonomous cars vs the value of some useful ML driver assistance) as companies use the AI hype to raise more capital but the reality is not aligned with the capabilities of generalized AI, at least in this current age of AI.

ML algorithms from the likes of Youtube and Facebook already manipulate our digital lives and behaviors with massive data they collect about us. Maybe AI is already here and in control and we are just the data simulation to generate more data for our AI overlords :) Anyway, my main point with sharing this post to share the post from Sutton (The Bitter Lesson) is to make us think about the data we control in business and enterprise world. Curating our data and more of it is what will still continue to drive ML and AI for the foreseeable future. So make sure to get your data quality and your data lakehouse BI/analytics in order ;)