Friday, January 24, 2020

Why Spark is the Wrong Abstraction


Is the sun setting on Spark? I don't want to knock Spark and frameworks like it, they have had their moment in the sun. Spark was a reasonable and important successor to Map/Reduce & HDFS/Hadoop, but its time has come to be exiled to the fridges of the big data ecosystem and used only when absolutely necessary. Spark is still has usefulness for some specialized ETL and data processing data applications, but overall Spark can be a massive overkill and burden to program and operate (expensive too). In many cases it is inefficient both in development, troubleshooting and the overhead of infrastructure management can be expensive relative to other options.

Not Everything is a Nail

I see many projects using Spark (and tools related to it such as aws emr for example) for transforming and moving data into and out of data lakes when often times simpler tools can be used. Spark is a pretty big and complex sledge hammer and many problems can often be solved with more effective tooling. The ever growing ubiquity of serverless technology, especially database and analytics capable services, have presented engineers with many more options and it is time to dial back when it is most appropriate to consider bringing out the Spark sledge hammer.

In a lot of cases, with Spark development, you end up writing a one-off database engine for your specific ETL scenario and using Spark's distributed compute and storage abstractions and DAG engine makes that convenient. While it is possible to use Spark as a database engine of sorts, the reality is databases are better at optimization and using the available compute/storage resources. And this is specially the case with serverless technologies that support SQL as a first class citizen. For Spark SQL is really does a bolted it on.

The only big challenge is that most database platforms are not designed for the cloud or for elastic compute/storage like Spark sort of is. I say sort of, because Spark leaves too much responsibility on the developer/DevOps to make data/compute optimization decisions and infrastructure decision which is something databases are intrinsically good at.

Declarative vs Imperative Analytics

Now, there are Spark serverless type of services as well (like aws glue and other managed Spark services), but given the general purpose nature of Spark, this still leaves optimizations and resource alloation a challenge for developers. What are the alternatives? I really like Presto and in particular the serverless aws flavor of it (Athena) as well as services like BigQuery. Tools like these are the future of big data ETL and analytics. Spark can still be useful in heavy data transformation scenarios and complex feature engineering, but not as a general data analytics engine and data movement engine. Streaming is one specialized area where solutions like Spark can still play, but there are many other solutions better designed out of the box for streaming and cloud scale-out. Spark has in many respects tried to be all things to all people. It has continuously expanding support for SQL semantics and has incorporating APIs for streaming...etc. This has made Spark a versatile framework and API for developers, but as a general purpose ETL and data analytics engine, I think there are now other options.

While the sun may not completely set on Spark, and tools like it, the declarative power of SQL will win in the end over the imperative programming model of Spark. This has been proven time and time again in the database and analytics tech space.

Tuesday, January 21, 2020

Data Lakes before AI/ML/Analytics (cart before horse thing)


Don't start or continue your AI and predictive analytics journey without building the necessary data infrastructure underpinnings. And that starts, first and foremost, with building a cloud data lake that is designed to meet the data and compute hungry needs of your . Why build a cloud data lake first?

1) Economics
2) Elastic compute
3) Elastic Storage
4) Storing (almost) everything
5) ML Model engineering
6) Feeding downstream analytics
7) Feeding downstream operational data stores
8) Data exploration, experimentation and discovery

A cloud data lake makes all the above possible at scale.

Building a cloud data lake securely and in an architecturally effective manner is achievable and will make your downstream AI/ML/Analytics journey attainable and long-term sustainable. Don't start your journey without this foundation.