Will Data Go Cloud Native?
The tools and platforms that data professionals are increasingly running on cloud native technology
Photo by Jonny Gios on Unsplash
Data tools remain an extremely active space, which is very exciting for me as a lazy user. Don't take my word for it, you can read Matt Turck's excellent post: Resilience and Vibrancy: The 2020 Data & AI Landscape. One thing I wonder, though, is whether cloud native data tools are going to become dominant.
What is cloud native?
For a variety of reasons, people have come to use the term "cloud native" two very different things.
"Cloud native" means using "containers, service meshes, microservices, immutable infrastructure, and declarative APIs", as described by the CNCF.
"Cloud native" means using the cloud service provider "native" tools. In this usage we're talking about provider-specific tools with tight integration, for example with AWS Control Tower, Config, or ParallelCluster.
It's unfortunate that there's two almost opposite usages of the term, but I can't do anything about that. (I actually think that you could end up implementing both meanings of the term in some instances, maybe somewhere in Serverless Land, so they aren’t quite opposites.) In any case, for this article, I'm going with the first usage.
Cloud Native Data Tools
Let's get real: data tools run on cloud native.
Spark now targets deployment on Kubernetes. This by itself is tremendous. In addition, AWS recently rolled out a feature for their managed Spark cluster that lets you run it on their managed Kubernetes cluster. Would I run production Spark workloads on Kubernetes? Not without a ton of testing and research which I have not done at this time.
Jupyterhub, RStudio, and Kubeflow are all examples of data infrastructure-as-software using Kubernetes. These are honorable, value-add tools that offer a consistent user experience to data scientists, using the tools that they are already familiar with (Kubeflow adds some new things, but also embeds Jupyter). This space is far from settled: AWS Sagemaker, Azure ML Studio, and Google AI Platform are all strong offerings, which sometimes lightly overlap with cloud native. The "open core" up and comer, Databricks, is driving hard to IPO. Now that Spark runs on Kubernetes, will Databricks develop a Helm chart?
Airflow, the popular data pipeline tool which recently hit 2.0, runs on Kubernetes. Kafka, the big honking streaming data platform, is simplifying its architecture, and the Strimzi project is simplifying Kafka deployment on Kubernetes. (Strimzi is currently a CNCF Sandbox project, meaning that it may be awhile longer before you can kiss your dedicated production Kafka clusters goodbye.) People are trying to put production databases on Kubernetes (and I have questions). ESRI, the winners of the geospatial data market, will add support for Kubernetes to their flagship product ArcGIS. Feature stores, a kind of multi-workflow database for MLOps, run on Kubernetes.
These technologies won't all melt into Kubernetes overnight, and cloud native hasn't won by any stretch, but there is clearly some momentum.
Don't Panic
Despite its complexity, I tend to think that Kubernetes sprawling into data infrastructure is just a function of Kubernetes sprawling into everything. It is moving into Edge. It is moving into IaaS. You can run blockchain on it apparently. Banks are using it.
What you definitely shouldn't do, unless you hate money, is run out and buy every data scientist on your team a copy of the Cloud Native Infrastructure book and a Kubernetes bootcamp. If you're a data professional reading this and thinking about "learning Kubernetes", that's a perfectly fine thing to do. It can be good to diversify. Just be aware what you're getting yourself into.