Thursday, April 15, 2021

Data Driven vs Data Model Driven Company

Somehow along the way data lakes got the rap that you can dump "anything" into them. I think this is carry over from the failed hippie free data love days of Hadoop and HDFS. No, a data lake is not a place you dump any kind of json, text, xml, log data...etc and just crawl it with some magic schema crawler then rinse and repeat. Sure you can take an approach of consume raw sources and then crawl them to catalog the structure. But this is a narrow case that you do NOT do in a thoughtless way. In many cases you don't need a crawler. 

Now with most data lakes you do want to consume in data raw form (ELT it more or less) but this does not mean just dump anything. You still must have expectations on structure and data schema contracts with the source systems you integrate with including dealing with schema evolution and partition planning. Formats like Avro, Parquet and ORC are there to transform your data into normalized and ultimately well curated (and DQ-ed) data models. Just because you got a "raw" zone in your data lake does not mean your entire data lake is a dumping ground of data of any type or your data source structures can just change at random.

Miracles required? This is what most of today's strategic AI and even BI/Analytics engineering and planning looks like. If you don't have your data modeled well and your data orchestration modularized and under reins then achieving the promise of cost effective and maintainable ML models and self-service BI is a leap of faith at best. Forget about being a data-driven company if you are not yet a data-model-driven company yet.

A data lake is a modern DW built on highly scalable cloud storage and compute and based on open data formats and open federated query engines. You can't escape the need for well thought out and curated data models. Does not matter you are using Parquet and S3 vs Snowflake and Redshift. Data models are what make BI and Analytics function.


Thursday, January 21, 2021

The AI Lesson for All of Us


There is no doubt that the brute force ML (aka deep learning) approach to achieve general AI or some level of human decision making by using more and more compute and more data has been successful over the past decade. 

I am fond of believing that there is more to AI than optimizing an objective function with more data and better hyper parameters - for example, integrating symbolic AI, knowledge graphs, causality...etc. However, trying to build systems to think the way we think we think may not be the future of AI, at least not yet. 

There is likely something beyond just bigger deep learning models - maybe it is software program synthesis or other genetically founded approaches - no one knows, as there is not enough research in these areas yet. But some form of AI is already here, self driving cars already use and construct 3D world models and utilize hand crafted rules mixed with deep learning sensor data analysis to give us the perception of AI decision making is going on. Efficiency also matters as we get into bigger and bigger models will billions of parameters. It is no joke how much energy some of the ML training (compute resources) that is required by many of these models (e.g. GPT-3). It is important to make sure we separate the hype (companies selling us on autonomous cars vs the value of some useful ML driver assistance) as companies use the AI hype to raise more capital but the reality is not aligned with the capabilities of generalized AI, at least in this current age of AI.

ML algorithms from the likes of Youtube and Facebook already manipulate our digital lives and behaviors with massive data they collect about us. Maybe AI is already here and in control and we are just the data simulation to generate more data for our AI overlords :) Anyway, my main point with sharing this post to share the post from Sutton (The Bitter Lesson) is to make us think about the data we control in business and enterprise world. Curating our data and more of it is what will still continue to drive ML and AI for the foreseeable future. So make sure to get your data quality and your data lakehouse BI/analytics in order ;)