Apache Iceberg promises to change the economics of cloud-based data analytics

Feature By 2015, Netflix had completed its move from an on-premises data warehouse and analytics stack to one based around AWS S3 object storage. But the environment soon began to hit some snags.

"Let me tell you a little bit about Hive tables and our love/hate relationship with them," said Ted Gooch, former database architect at the streaming service.

While there were some good things about Hive, there were also some performance-based issues and "some very surprising behaviors."

"Because it's not a heterogeneous format or a format that's well defined, different engines supported things in different ways," Gooch – now a software engineer at Stripe and an Iceberg committer – said in an online video posted by data lake company Dremio.

Out of these performance and usability challenges inherent in Apache Hive tables in large and demanding data lake environments, the Netflix data team developed a specification for Iceberg, a table format for slow-moving data or slow-evolving data, as Gooch put it. The project was developed at Netflix by Ryan Blue and Dan Weeks, now co-founders of Iceberg company Tabular, and was donated to the Apache Software Foundation as an open source project in November 2018.

Apache Iceberg is an open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. The move promises to help organizations bring their analytics engine of choice to their data without going through the expensive and inconvenience of moving it to a new data store. It has also won support from data warehouse and data lake big hitters including Google, Snowflake and Cloudera.

Cloud-based blob storage like AWS S3 does not have a way of showing the relationships between files or between a file and a table. As well as making life tough for query engines, it makes changing schemas and time travel difficult. Iceberg sits in the middle of what is a big and growing market. Data lakes alone were estimated to be worth $11.7 billion in 2021, forecast to grow to $61.07 billion by 2029.

"If you're looking at Iceberg from a data lake background, its features are impressive: queries can time travel, transactions are safe so queries never lie, partitioning (data layout) is automatic and can be updated, schema evolution is reliable – no more zombie data! – and a lot more," Blue explained in a blog.

But it also has implications for data warehouses, he said. "Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses."

In October, BigLake, Google Cloud's data lake storage engine, began support for Apache Iceberg, with Databricks format Delta and Hudi streaming set to come soon.

Speaking to The Register, Sudhir Hasbe, senior director of product management at Google Cloud, said: "If you're doing fine-grained access control, you need to have a real table format, [analytics engine] Spark is not enough for that. We had some discussion around whether we are going with Iceberg, Delta or Hudi, and our prioritization was based customer feedback. Some of our largest customers were basically deciding in the same realm and they wanted to have something that was really open, driven by the community and so on. Snap [social media company] is one of our early customers, all their analytics is [on Google Cloud] and they wanted to push us towards Iceberg over other formats."

He said Iceberg was becoming the "primary format," although Google is committed to supporting Hudi and Delta in the future. He noted Cloudera and Snowflake were now supporting Iceberg while Google has a partnership with Salesforce over the Iceberg table format.

Cloudera started in 2008 as a data lake company based on Hadoop, which in its early days was run on distributed commodity systems on-premises, with a gradual shift to cloud hosting coming later.

Today, Cloudera sees itself as a multi-cloud data lake platform, and in July it announced its adoption of the Iceberg open table format.

Chris Royles, Cloudera's Field CTO, told The Register that since it was first developed, Iceberg had seen steady adoption as the contributions grew from a number of different organizations, but vendor interest has begun to ramp up over the last year.

"It has lots of capability, but it's very simple," he said. "It's a client library: you can integrate it with any number of client applications, and they can become capable of managing Iceberg table format. It enables us to think in terms of how different clients both within the Cloudera ecosystem, and outside it – the likes of Google or Snowflake – could interact with the same data. Your data is open. It's in a standard format. You can determine how to manage, secure and own it. You can also bring whichever tools you choose to bear on that data."

The result is a reduction in the cost of moving data, and improved throughput and performance, Royles said. "The sheer volume of data you can manage the number of data objects you can manage and the complexity of the partitioning: it's a multiplication factor. You're talking five or 10 times more capable by using Iceberg as a table format."

Snowflake kicked off as a data warehouse, wowing investors with its so-called cloud-native approach to separating storage and compute, allowing a more elastic method than on-prem-based data warehousing. Since its 2020 IPO – which briefly saw it hit a value of $120 billion – the company has diversified as a cloud-based data platform, supporting unstructured data, machine learning language Python, transactional data and most recently Apache Iceberg.

James Malone, Snowflake senior product manager, told El Reg that cloud blob storage such as that offered by AWS, Google and Azure is durable and inexpensive, put could present challenges when it comes to performance analytics.

"The canonical example is if you have 1,000 Apache Parquet files, if you have an engine that's operating on those files, you have to go tell it if they these 1000 tables with one parquet file a piece or if it is two tables with 500 parquet files … it doesn't know," he said. "The problem is even more complex when you have multiple engines operating on the same set of data and then you want things like ACID-compliance and like safe data types. It becomes a huge, complicated mess. As cheap durable cloud storage has proliferated it has also put pressure downward pressure on the problem of figuring out how to do high-performance analytics on top of that. People like the durability and the cost-effectiveness of storage, but they also there's a set of expectations and a set of desires in terms of how engines can work and how you can derive value from that data."

Snowflake supports the idea that Iceberg is agnostic both in terms of the file format and analytics engine. For a cloud-based data platform with a steadily expanding user base, this represents a significant shift in how customers will interact with and, crucially, pay for Snowflake.

The first and smallest move is the idea of external tables. When files are imported into an external table, metadata about the files is saved and a schema is applied on read when a query is run on a table. "That allows you to project a table on top of a set of data that's managed by some other system, so maybe I do have a Hadoop cluster that I have a meta store that that system owns the security, it owns the updates, it owns the transactional safety," Malone said. "External tables are really good for situation like that, because it allows you to not only query the data in Snowflake, but you can also use our data sharing and governance tools."

But the bigger move from Snowflake, currently only available in preview, is its plan to build a brand-new table type inside of Snowflake. It is set to have parity in terms of features and performance with a standard Snowflake table, but uses Parquets as the data format, and Iceberg as the metadata format. Crucially, it allows customers to bring their own storage to Snowflake instead of Snowflake managing the storage for them, perhaps a significant cost in the analytics setup. "Traditionally with the standard Snowflake table, Snowflake provides the cloud storage. With an Iceberg table, it's the customer that provides the cloud storage and that's a huge shift," Malone said.

The move promises to give customers the option of taking advantage of volume discounts negotiated with blob storage providers across all their storage, or negotiate new deals based on demand, and only pay Snowflake for the technology it provides in terms of analytics, governance, security and so on.

"The reality is, customers have a lot of data storage and telling people to go move and load data into your system creates friction for them to actually go use your product and is not generally a value add for the customer," Malone said. "So we've built Iceberg tables in a way where our platform benefits work, without customers having to go through the process of loading data into Snowflake. It meets the customer where they are and still provides all of the benefits."

But Iceberg does not only affect the data warehouse market, it also has an impact on data lakes and the emerging lakehouse category, which claims to be a useful combination of the data warehouse and lake concepts. Founded in 2015, Dremio places itself in the lakehouse category also espoused by Databricks and tiny Californian startup Onehouse.

Dremio was the first tech vendor Source: The register

Home

Apache Iceberg promises to change the economics of cloud-based data analytics