Google Aims to Wed Apache Spark to Kubernetes

September 10, 2019September 10, 2019 Mike Vizard Apache Spark, big data, google, Google Cloud, in-memory computing, kubernetes

by Mike Vizard

Google today announced it will extend Google Cloud Dataproc, a managed service for accessing the Apache Spark in-memory computing framework, to Kubernetes.

James Malone, a Google senior product manager, says Google will now make the managed Apache Spark service available on top of Kubernetes to make it easier for clusters running the Apache framework to scale up and down as needed. It is available in alpha today on the Google Container Service (GKE).

Google also plans to deploy other open source processing engines such as Apache Flink on top of Kubernetes as well, Malone says.

Google intends to continue to make available the existing Google Cloud DataProc service based on YARN (Yet Another Resource Manager), Malone notes. However, he expects over time organizations will prefer to standardize on Kubernetes to manage resources across multiple processing engines.

Malone says the primary goal is to provide data professionals with a more agile approach that employs containers to eliminate dependencies on specific versions and libraries, which then makes it possible to move data pipelines between clusters. GKE, meanwhile, eliminates the need for data scientists to concern themselves with sizing and building clusters, manipulating Docker files or Kubernetes networking configurations.

Sponsorships Available

Google also has already isolated Google Cloud DataProc from the Hadoop File System (HDFS), which means instances of Apache running on Kubernetes in the Google Cloud are natively accessing cloud storage resources or the Big Query data warehouse, he adds.

Malone says Google is trying to lower the bar to accessing big data analytics by combining two open source technologies led by different open source organizations. The Apache Software Foundation (ASF) oversees the development of Spark, while Kubernetes is developed under the auspices of the Cloud Native Computing Foundation (CNCF). Google, he says, views itself as being in a unique position to meld the two platforms into a single service.

It’s not clear just yet what the future holds for big data analytics platforms. In the wake of some high-profile acquisition among providers of Hadoop platforms, there’s a lot more discussion about hosting big data analytics platforms directly on cloud storage services. That approach eliminates the need to rely on HDFS to manage big data. However, there are still a significant number of instances of Hadoop running in on-premises environments that rely on HDFS to manage massive amounts of data.

Malone says IT organizations, however, should also expect Google to make a case for using its Anthos cloud platform, which is based on Kubernetes, as a means to avoid being locked into any one platform.

In the meantime, IT organizations should expect to see a lot more melding of big data platforms and Kubernetes. Many of these platforms developed their own substrates for accessing and managing resources before Kubernetes became a de facto standard. The challenge and the opportunity now, of course, will be migrating some of very large existing instances of big data platforms on to Kubernetes in the months and years ahead.