What is Apache Iceberg? And why it's so useful

Apache Iceberg is an open table format for huge analytic datasets that brings reliability and simplicity to data lakes. Here's why data engineers are excited about it.

If you’ve been following the data engineering world, you’ve probably heard about Apache Iceberg. It’s become one of the most talked-about technologies in the modern data stack, and for good reason.

What is Apache Iceberg?

Apache Iceberg is an open table format designed for huge analytic datasets. Think of it as a layer that sits between your compute engines (like Spark, Trino, or Flink) and your storage layer (like S3, HDFS, or cloud object storage).

The key insight behind Iceberg is that traditional data lake formats like Hive tables have significant limitations when it comes to reliability, performance, and usability. Iceberg solves these problems by introducing a modern table format that brings database-like features to data lakes.

Why is it useful?

1. ACID transactions

One of the biggest pain points with traditional data lakes is the lack of transactional guarantees. If a write operation fails halfway through, you can end up with corrupted or inconsistent data. Iceberg provides full ACID transaction support, meaning your writes either fully succeed or fully fail—no partial states.

2. Schema evolution

With Iceberg, you can safely add, remove, rename, or reorder columns without rewriting your entire dataset. This might sound simple, but it’s a huge deal when you’re working with petabyte-scale tables. Traditional formats often require expensive and time-consuming data migrations for schema changes.

3. Hidden partitioning

Iceberg handles partitioning automatically based on column values you specify. Users don’t need to know how the data is partitioned to write efficient queries—Iceberg optimizes scans automatically. This eliminates a common source of bugs where queries accidentally scan all partitions.

4. Time travel

Every change to an Iceberg table creates a new snapshot. You can query your data as it existed at any point in time, which is invaluable for debugging, auditing, and reproducing historical analyses.

5. Engine agnostic

Unlike formats tied to specific engines, Iceberg works with Spark, Trino, Flink, Dremio, Snowflake, and many others. This means you can choose (or switch) your compute engine without changing your data format.

When should you use it?

Iceberg shines when you have:

Large-scale analytical workloads (terabytes to petabytes)
Multiple teams or applications reading/writing the same data
Requirements for data versioning or point-in-time queries
Complex data pipelines that need transactional guarantees
A need for engine flexibility or multi-engine architectures

If you’re building a modern data lakehouse, Iceberg (along with Delta Lake and Apache Hudi) represents the current state of the art.

Getting started

The easiest way to try Iceberg is with Apache Spark. You can create an Iceberg table with just a few lines of configuration:

spark.sql("""
    CREATE TABLE my_catalog.my_db.my_table (
        id bigint,
        data string,
        category string
    )
    USING iceberg
    PARTITIONED BY (category)
""")

From there, you can use standard SQL for inserts, updates, and deletes—something that was surprisingly difficult with older data lake formats.

The bottom line

Apache Iceberg represents a significant step forward for data lakes. It brings the reliability and usability we’ve come to expect from traditional databases while maintaining the scalability and cost benefits of cloud object storage.

If you’re building data infrastructure today, Iceberg is worth serious consideration. The ecosystem is mature, adoption is accelerating, and the benefits are substantial.