Skip to content

DataMasters

logo
  • Home
  • Courses
  • Learning Paths
  • Blog

Apache Iceberg

3
  • [01] – Introduction to Apache Iceberg
  • [02] – Getting Started with Apache Iceberg
  • [03] – Apache Iceberg Architecture

Apache Spark

4
  • [00] – Apache Spark By Example Course
  • [01] – What is Apache Spark?
  • [02] – Installing Apache Spark Locally
  • [03] – Deploy Apache Spark with Kubernetes (K8s)

SQL - Advanced

2
  • [02] – View vs Materialized View
  • [03] – Window function in SQL

SQL - Basics

1
  • 02 – Understanding SQL Operations (DML, DDL, DCL, and TCL)

SQL - Intermediate

1
  • SQL Joins: Understanding INNER, LEFT, RIGHT, and FULL Joins
View Categories
  • Home
  • Docs
  • Data Processing
  • Apache Iceberg
  • [02] – Getting Started with Apache Iceberg

[02] – Getting Started with Apache Iceberg

In this tutorial, you’ll learn how to install Apache Iceberg locally and get hands-on experience using Apache Spark as a compute engine.

First, we’ll set up the environment by creating a Python virtual environment and installing PySpark. Please note that some prior knowledge of PySpark is required to follow this tutorial.

Next, you’ll learn how to read and write PySpark DataFrames using Apache Iceberg.

Finally, you’ll explore Iceberg’s key features, including schema evolution, ACID transactions, time travel, and query isolation.

Prerequisites #

Before you start a PySpark session with Apache Iceberg, you will need to have the following installed:

  • Java (v8 or v11)
  • PySpark
  • Apache Spark (the version depends on the Iceberg version)

Install required dependencies #

Before working with Apache Iceberg and PySpark, you’ll need to install the required dependencies. Use the following commands to create and activate a virtual environment (named iceberg, or any name you prefer), then install the pyspark package.

1
2
3
python -m venv iceberg
source iceberg/bin/activate
pip install pyspark==3.4.1

If you are using Windows, run the command .\iceberg\Scripts\activate to activate the virtual environment.

The versions used in this article are:

  • Python: 3.11.6
  • PySpark: 3.4.1

Import required packages #

Once you install the dependencies, import the necessary packages that you will use in the following sections.

🐍
1
2
3
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

Setup Iceberg Configuration #

To start working with Iceberg tables in PySpark, you first need to configure the PySpark session correctly. In this setup, you’ll use a catalog named demo that points to a local warehouse directory at ./warehouse, using the Hadoop catalog type.

The warehouse path acts as the storage location for all Iceberg table data and metadata. It’s essential for managing table state and ensuring consistent access across read/write operations.

Make sure the Iceberg-Spark Runtime JAR you use is compatible with your PySpark version. You can find the appropriate JAR files in the official Iceberg releases. For more advanced settings, refer to the Iceberg-Spark-Configuration documentation.

🐍
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
arehouse_path = "./warehouse"
iceberg_spark_jar  = 'org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.0'
catalog_name = "demo"

# Setup iceberg config
conf = SparkConf().setAppName("YourAppName") \
    .set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .set(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .set('spark.jars.packages', iceberg_spark_jar) \
    .set(f"spark.sql.catalog.{catalog_name}.warehouse", warehouse_path) \
    .set(f"spark.sql.catalog.{catalog_name}.type", "hadoop")\
    .set("spark.sql.defaultCatalog", catalog_name) 

# Create spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

With the Spark session now correctly configured, let’s move on to creating and reading an Iceberg table.

📄
filename.js
# Create a dataframe
schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
    StructField('job_title', StringType(), True)
])
data = [("person1", 28, "Doctor"), ("person2", 35, "Singer"), ("person3", 42, "Teacher")]
df = spark.createDataFrame(data, schema=schema)

# Create database
spark.sql(f"CREATE DATABASE IF NOT EXISTS db")

# Write and read Iceberg table
table_name = "db.persons"
df.write\
  .format("iceberg")\
  .mode("overwrite")\
  .saveAsTable(f"{table_name}")

iceberg_df = spark.read\
  .format("iceberg")\
  .load(f"{table_name}")

iceberg_df.printSchema()
iceberg_df.show()

In the code above, we created a PySpark DataFrame, wrote it to an Iceberg table db.persons, and then read the data back for display.

Now, let’s explore the key features of Iceberg that help solve the challenges outlined in the introduction.

Schema Evolution #

Data lakes support various data formats, but that flexibility often makes schema management a challenge. Iceberg addresses this by allowing you to add, remove, or modify columns without rewriting the entire dataset — making schema evolution seamless and efficient.

Let’s update the existing table db.personsto demonstrate how schema evolution works in practice.

Data lakes support a wide range of data formats, but that flexibility can make schema management difficult. Iceberg solves this by allowing columns to be added, removed, or modified without rewriting the entire dataset. This makes it much easier to evolve schemas over time.

📄
filename.js
spark.sql(f"ALTER TABLE {table_name} RENAME COLUMN job_title TO job")
spark.sql(f"ALTER TABLE {table_name} ALTER COLUMN age TYPE bigint")
spark.sql(f"ALTER TABLE {table_name} ADD COLUMN salary FLOAT AFTER job")
iceberg_df = spark.read.format("iceberg").load(f"{table_name}")
iceberg_df.printSchema()
iceberg_df.show()

spark.sql(f"SELECT * FROM {table_name}.snapshots").show()

The code above demonstrates schema evolution with three operations:

  • Renaming a column
  • Changing a column’s data type
  • Adding a new column

As shown in the screenshots below, job_title has been renamed to job, the age column’s type has been updated, and a new column, salary, has been added.

Apache Iceberg with PySpark
Schema before altering the table
Apache Iceberg with PySpark
Schema after altering the table

The first time you run the code, in the snapshot table you notice that Iceberg executed all alterations without rewriting the data. This is indicated by having only one snapshot ID and no parent (parent_id = null), signifying that no data rewriting was performed.

How to use Apache Iceberg with PySpark
Snapshot table

ACID transactions #

Data accuracy and consistency are crucial in data lakes, particularly for business-critical purposes. Iceberg supports ACID transactions for write operations, ensuring that data remains in a consistent state, and enhancing the reliability of the stored information.

To demonstrate the ACID with Iceberg table let’s update, add, and delete records from the table.

📄
filename.js
spark.sql(f"UPDATE {table_name} SET salary = 100")
spark.sql(f"DELETE FROM {table_name} WHERE age = 42")
spark.sql(f"INSERT INTO {table_name} values ('person4', 50, 'Teacher', 2000)")
spark.sql(f"SELECT * FROM {table_name}.snapshots").show()

In the snapshots table, we can now observe that Iceberg has added three snapshot IDs, each created from the preceding one. If, for any reason, one of the actions fails, the transactions will fail, and the snapshot won’t be created.

How to use Apache Iceberg with PySpark
ACID transactions

Partition Evolution #

As you may be aware, querying large amounts of data in data lakes can be resource-intensive. Iceberg supports data partitioning by one or more columns. This significantly improves query performance by reducing the volume of data read during queries.

🐍
1
2
spark.sql(f"ALTER TABLE {table_name} ADD PARTITION FIELD age")
spark.read.format("iceberg").load(f"{table_name}").where("age = 28").show()

The code creates a new partition based on the age column. This partitioning applies only to new rows inserted after the change; existing data remains unaffected.

Apache Iceberg with PySpark
Partitioned DataFrame

You can also define partitions at table creation time using the PARTITIONED BY clause, as shown below.

📄
1
2
3
4
5
6
spark.sql(f"""
        CREATE TABLE IF NOT EXISTS {table_name}
        (name STRING, age INT, job STRING, salary INT)
        USING iceberg
        PARTITIONED BY (age)
    """)

Time Travel #

Analyzing historical trends or tracking changes over time is a common need in data lakes. Iceberg addresses this with its time travel feature, allowing users to query data as it existed at a specific snapshot or timestamp.

This capability lets you:

  • Load any historical version of a table
  • Examine how data changed over time
  • Roll back to a previous state if needed

Iceberg’s time-travel API makes it easy to perform audits, recover from errors, or support reproducible analysis based on past data states.

📄
1
2
3
4
5
6
7
8
# List snapshots
spark.sql(f"SELECT * FROM {table_name}.snapshots").show(1, truncate=False)

# Read snapshot by id run
spark.read.option("snapshot-id", "306576903892976364").table(table_name).show()

# Read at a given time
spark.read.option("as-of-timestamp", "306576903892976364").table(table_name).show()

The code above Iceberg’s time travel feature. The first line queries the table’s snapshot metadata to list available snapshots. The second line reads the table as it existed at a specific snapshot ID. The third line reads the table as it was at a specific point in time, using a timestamp.

Updated on August 24, 2025

What are your Feelings

  • Happy
  • Normal
  • Sad

Share This Article :

  • Facebook
  • X
  • LinkedIn
  • Pinterest
[01] – Introduction to Apache Iceberg[03] – Apache Iceberg Architecture

Leave a Reply Cancel reply

You must be logged in to post a comment.

Table of Contents
  • Prerequisites
    • Install required dependencies
    • Import required packages
    • Setup Iceberg Configuration
    • Schema Evolution
    • ACID transactions
    • Partition Evolution
    • Time Travel
datamasters-logo
Facebook-f Twitter Tumblr

About Us

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

  • +(805) 498-4719
  • Medplus@info.com
  • 3421 Lesser Dr Newbury Park, CA 91320

Departments

Who Are We

Our Mission

Awards

Experience

Success Story

Quick Links

Who Are We

Our Mission

Awards

Experience

Success Story

© 2025 Master Data

Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}