Paxata Employs Kubernetes to Extend Data Prep Tool

Paxata has extended the reach of its data preparation software for big data into the realm of the cloud by employing Kubernetes clusters to create a runtime to process jobs in batch mode.

Piet Loubser, senior vice president and global head of marketing, says that by extending the data prep engine Paxata built on of the Apache Spark in-memory computing framework, the company is making it possible for organizations to run batch-oriented data prep jobs using cloud services that can support a runtime based on Kubernetes.

That runtime extension is the first in a series of Kubernetes initiatives that could lead to a shift away from a subscription licensing model to one based on actual consumption of Paxata software, Loubser adds.

Paxata also has added support for other orchestration frameworks, including Microsoft Azure HDInsight and Apache Hadoop YARN, but Loubser says Kubernetes eventually will become the more strategic platform because it will eliminate the need for organizations to deploy a separate cluster to run the Apache Spark framework. The goal, he says, is to streamline the number of types of clusters an IT organization needs to master.

Loubser notes the Paxata approach to data prep is unique because it now also includes an Adaptive Workload Management feature that enables organizations to define their own interactive data volumes rather than limiting them to a fixed amount of data that requires a business analyst to sample continually.

Data prep tools have taken on an added importance because business analysts rely on them to identify errors and anomalies in data sets before they are exposed to analytics applications. However, for many organizations, the size of those data sets exceeds anything they can effectively process on-premises. Support for Kubernetes should allow Paxata customers to essentially burst workloads into a public cloud that would only need to run for a limited time to accomplish a a specific task whenever required.

To achieve that goal, many organizations are adopting more formal DataOps processes that borrow many of the agile concepts pioneered by organizations that embraced DevOps to create data pipelines. In the future, those data pipelines will be employed to drive any number of models built using machine and deep learning algorithms that require access to massive amounts of data. There may even come a day when those DataOps and DevOps processes start to meld seamlessly into one another as various types of AI engines become pervasively embedded in almost every application imaginable.

At the same time, Kubernetes is quickly emerging as a key enabling technology driving the emergence of true hybrid cloud computing. IT organizations want to be able to employ multiple clouds to run a variety of Big Data applications. But that doesn’t necessarily mean IT organizations want to master multiple data prep tools for each platform. Kubernetes essentially provides a common layer of abstraction that enables Paxata to create a runtime that can be distributed almost anywhere.

Mike Vizard

Mike Vizard is a seasoned IT journalist with over 25 years of experience. He also contributed to IT Business Edge, Channel Insider, Baseline and a variety of other IT titles. Previously, Vizard was the editorial director for Ziff-Davis Enterprise as well as Editor-in-Chief for CRN and InfoWorld.

Mike Vizard has 1606 posts and counting. See all posts by Mike Vizard