Skip to main content

Episode 14: What Is Apache Iceberg?

Lars Kamp

Apache Iceberg is a new table format that offers both the simplicity of SQL and separation of storage and compute. The Iceberg table format works with any compute engine, so users are not limited to working with a single engine. Popular engines (e.g., Spark, Trino, Flink, and Hive) and modern cloud warehouses (e.g., Snowflake, Redshift, and BigQuery) can work with Iceberg tables at the same time.

A table format is a layer that sits between the file format and database. Iceberg is an abstraction layer above file formats like Parquet, Avro, and ORC born out of necessity at Netflix. Like many other companies at the time, Netflix shifted from MPP data warehouses to the Hadoop ecosystem in the 2010s. MPP warehouses like Teradata were hitting scale limitations and becoming too expensive at Netflix's scale.

The Hadoop ecosystem abandoned the table abstraction layer in favor of scale. In Hadoop, we deal directly with file systems like HDFS. The conventional wisdom at the time was that bringing compute to storage was easier than moving the data to compute. Hadoop scales compute and disk together, which turned out to be incredibly hard to manage in the on-premise world.

Early on, Netflix shifted to the cloud and started storing data in Amazon S3 instead, which separated storage from compute. Snowflake, the cloud warehouse, also picked up on that principle, bringing back SQL semantics and tables from "old" data warehouses.

Netflix wanted to separate storage/compute and SQL table semantics. They wanted to add, remove, and rename columns without S3 paths. But rather than going with another proprietary vendor, Netflix wanted to stay in open source and open formats. And thus, Iceberg was developed and eventually donated to the Apache Foundation. Today, Iceberg is also in use at companies like Apple and LinkedIn.

Tabular commercializes Apache Iceberg. Working with open-source Iceberg tables still requires understanding of object stores and distributed data processing engines and how various components interact with each other. Tabular lowers the bar for adoption and removes the heavy lifting.

Jason Reid is a co-founder and heads Product at Tabular. In this episode, Jason walks us through the benefits of using an open table format like Iceberg and how it works with existing analytics infrastructure and tooling of the modern data stack like dbt.