The team behind in-process OLAP database DuckDB has put forward a solution to the "small changes" problem that they say plagues lakehouse implementations of the kind based on technologies from Databricks, Snowflake, Google, and others.
The consulting and support company behind the open source RDBMS has just released the first production-ready iteration of its DuckLake lakehouse format following a manifesto launch last year. The May 2025 DuckLake manifesto promised to re-engineer the concept of combining data warehouses and data lakes on a single system.
Essentially, it proposed using an RDBMS to manage the metadata in lakehouse implementations based on the common open table formats Apache Iceberg and Delta Lake (introduced by Databricks, governed by the Linux Foundation), showing engineering how they could use PostgreSQL, SQLite or DuckDB as the catalog database for the task.
With its DuckLake v1.0, a production-ready lakehouse format specification launched this week, the DB gurus are showing how the database can be used to solve the so-called "small changes" problem common to lakehouse systems based on open table formats, which rely on file format Parquet.
Hannes Mühleisen, a DuckDB Labs co-founder and CEO, told The Register: "You make a small change to your table, adding a single row, and it affects data lake performance because, due to the way they work, a new file has to be written that ... contains one row, and then a bunch of metadata has to be written ... and then the catalog has to make an update. This is very inefficient, because formats like Parquet really don't want to store a single row, they want to store a million rows, and retrieving all these tiny files from object stores is extremely inefficient because you do all these transfers."
The DuckLake approach uses the metadata RDBMS to batch up those small changes and then transfers them to Parquet in relatively bigger chunks, said Mühleisen, who is also professor at Amsterdam's Centrum Wiskunde & Informatica mathematical and theoretical computing research center.
"The key design difference between other data lake formats and DuckLake is that we have a database and we're not afraid of using it. We have all this metadata about the data lake in a catalog in a DuckLake database where we know which tables exist; which files exist; how they all belong together; what changes have happened over time – all that stuff. Now, you're adding a single row and instead of writing a new file to the object store, we're going to add that to a table in the database. The key insight here is that database systems like PostgreSQL, but also DuckDB and others, are much, much better at handling small changes than object stores," he said.
The metadata database stores small changes, such as additions and deletions of rows until they are eventually "flushed" back to Parquet again as a relatively bigger file, while remaining "completely transparent to the user at the same time," he said.
In a blog post accompanying the 1.0 launch, Pedro Holanda, DuckDB Labs principal engineer, said the company's benchmark shows 926× faster queries and 105× faster ingestion when compared to Iceberg, the open table format.
"When I wrote the blog post about saying we had the 1,000x difference, I felt like, 'Oh, some people are gonna get angry,' but no one got angry. They're like, 'This is a real problem.' I even had someone say they were cheating with the architecture. That's the whole point: cheating with a better design," he told The Register.
Engineers continue to build around existing lakehouse architecture though, and are trying to solve the same problems. On the launch of DuckLake last year, Jake Ye, an AWS veteran and software engineer at AI database company LanceDB, blogged that the industry has been "increasingly consolidated around JSON-based protocols as the foundation for interoperability." At the same time, there were adoption challenges around DuckLake without good structured extensibility, versioning and transport-layer separation, he said.
Russell Spitzer, principal engineer with Snowflake, at the time told us many projects were "pretty far along the road with Iceberg, and the Iceberg community is already addressing the metadata catalog problems. DuckDB is still a fledgling database while incumbents are already well entrenched in the market. We might have to wait a while before learning whether the DuckLake concept will fly." ®
Source: The register