Innovation in enterprise data management and understanding what a data lakehouse is

Let’s be honest. If you were asked to pick a boring sounding topic from a list of new technologies, there’s no doubt that something dubbed “enterprise data management” would be one of your top choices. After all, it doesn’t exactly scream sexy or exciting. However, it turns out that the ability to garner meaningful business insights from a host of different data sources in a timely and secure manner is important for organizations of all sizes.

Toss in the fact that AI-powered analytics can be leveraged to generate the information, and that a pre-configured cloud-based offering can automatically take care of the messy, challenging, behind-the-scenes prep work necessary to get those insights, and things start to get more interesting.

Cloudera is a software company dedicated to provide enterprise data management systems, it started as an open-source software company based primarily around the Apache Hadoop big data analytics tools and merged a few years back with Hortonworks, another Hadoop-focused company.

Generally seen as a leader in large-scale data management applications, Cloudera continues to make important contributions to the open-source community and has been a leader in its efforts to create a completely open data lakehouse platform — the hottest trend in big data.

They also just announced a new CDP One SaaS solution that is supposed to offer all of these capabilities. More importantly, because of how it’s built, it should open up their advanced data platform (CDP) to a wider range of companies and a broader group of individuals within those organizations.

For those who may not know what a data lakehouse is, think of it as a combination of a data lake, which is primarily used with unstructured and semi-structured data, such as text, audio, video and images, and a data warehouse, which is most commonly used with traditional, table-based structured data of numbers, values, etc.

A data lakehouse essentially combines the best of these two worlds by enabling the kinds of structured queries that have been traditionally offered only with data warehouses to the unstructured data in data lakes. In addition, it lets organizations do analysis across the two data types simultaneously, which turns out is incredibly useful for machine learning and other advanced AI-based applications.

As great as this sounds in theory, however, the truth is that it’s very difficult to do. In fact, pulling meaningful business insights from this diverse set of data is a task that has typically been limited to the rarified world of data scientists and the specialized skill sets they possess. These individuals are in great demand right now, making them difficult for many companies to find and very expensive to recruit and retain. In addition, the tools necessary to do this work — such as the existing Cloudera Data Platform — while very powerful, are not for the technically faint of heart.

Practically speaking, what that means is that, while organizations now have more access to potentially interesting and larger data sets than they’ve ever had before and the tools to fully leverage this data have grown increasingly capable, only the largest, most technically sophisticated companies have been able to take advantage of this incredibly powerful combination. More companies, and the market in general, need something that can bring these types of advanced data management and analytics tools to a larger audience — hence the launch of CDP One. It’s Cloudera’s effort to bring the kinds of capabilities and data management tools from its current CDP Private Cloud on-premises and CDP Public Cloud offerings to a more mainstream audience.

Part of the problem is that this isn’t an easy thing to do. Enterprise data management has remained an obscure topic for many because of how much work and expertise is necessary for these types of projects. For one, you have to get access to and import or “ingest” the various data sets you want to work with. As with many aspects of big data, the data ingest process is something that sounds straightforward in theory but turns out to be challenging in practice.

For example, because data can come from any combination of public cloud resources, on-premises databases, SaaS application outputs, real-time streaming inputs and more, it can be challenging to bring together all the elements that organizations want to analyze. In addition, it turns out that the format of the tables in which some types of data are stored is proprietary, bringing further hassles to the ingest process. To help with that, Cloudera recently added support for the open-source Apache Iceberg format data table to CDP, yet another example of the company’s effort to support open standards.

Additionally, data often needs to be prepped and/or modified to make it ready for manipulation and analysis. In order to do that, various cloud-based computing, storage, and networking resources may need to be configured to handle this work. Plus, ML or AI models may need to be loaded or adjusted to begin the analysis work. Finally, above all of this is the need to ensure that no data gets accidentally released, no security holes get created, etc. in the process of configuring and enabling all these resources. Respectively known as DevOps, MLOps, and SecOps, these three critical sets of operational functions can be some of the most time- and resource-consuming parts of a big data analysis project. Recognizing this challenge, one of the key benefits of CDP One is what Cloudera calls Zero Ops, meaning it takes care of all that work itself, making the move to the critical data analysis part of the process much easier and faster.

The data analysis tools themselves can be a bit daunting for all but the most technically advanced data scientists, developers, or business intelligence analysts. Cloudera is thus making a move towards the growing interest in low-code, no-code tools for analysis and visualization. The goal is to allow even sophisticated business users the ability to leverage the cloud-based data management and analysis tools from CDP into their regular workflow.

In truth, we’ve been talking about the benefits of big data analytics for what seems like a decade or more now. What has become apparent over the ensuing years is that achieving useful results from these efforts is a lot harder than most realized (and that most companies and tech vendors are willing to admit). With CDP One, Cloudera looks to be making solid strides towards overcoming this gap. It’s also bringing potentially exciting opportunities for leveraging important insights from large data sets to a much wider audience.