[00] – Apache Spark By Example Course

This tutorial series will guide you from the fundamentals of Apache Spark to advanced concepts, with hands-on examples to help you become proficient in big data processing. Each section will build on the previous one, using Scala in AWS Glue (or local Spark setups).

Part 1: Introduction to Apache Spark #

What is Apache Spark? #

Overview of Spark’s architecture
Differences between Spark and Hadoop MapReduce
Use cases in real-world applications

Setting Up Spark #

Installing Spark locally
Running Spark on Jupyter
Introduction to Spark Shell and Spark Submit

First Spark Application #

Writing and running a “Hello World” Spark job
Understanding RDDs, DataFrames, and Datasets

Part 2: Working with Data in Spark #

Spark Core Concepts #

Resilient Distributed Datasets (RDDs) vs. DataFrames vs. Datasets
Understanding transformations and actions
Lazy evaluation and DAG (Directed Acyclic Graph)

Data Ingestion in Spark #

Reading data from CSV, JSON, Parquet, and Databases
Loading data from AWS S3 (Glue Catalog)
Handling schema inference

Data Transformations #

Using map(), filter(), flatMap()
Performing groupBy(), reduceByKey(), aggregateByKey()
Joining datasets efficiently (join, broadcast join, map-side join)

Writing Data Back #

Writing to CSV, Parquet, and JSON
Partitioning and bucketing for optimized queries
Writing to AWS S3, Glue Catalog, and Databases

Part 3: Advanced Transformations and Optimization #

Spark SQL and DataFrames API #

Creating and querying DataFrames
Registering temporary views and using SQL queries
Performance comparison: RDDs vs. DataFrames vs. Datasets

Optimization Techniques #

Catalyst Optimizer & Tungsten Execution Engine
Understanding lazy execution and Spark jobs
Repartitioning vs. Coalescing for performance
Using persist() and cache() effectively

Working with UDFs (User-Defined Functions) #

Writing custom Scala UDFs
Using Pandas UDFs for performance (Spark 3.0+)

Part 4: Spark Streaming & Real-Time Data Processing #

Introduction to Structured Streaming #

Difference between batch processing and stream processing
Understanding the micro-batch model

Building a Real-Time Streaming Pipeline #

Reading from Kafka, AWS Kinesis
Writing to S3, Databases, NoSQL stores (DynamoDB, MongoDB)

Stateful Processing #

Handling watermarking, aggregation, and window functions
Using checkpointing and stateful operations

By the end of this course, you will have a strong understanding of Apache Spark and its real-world applications, from basic data transformations to advanced optimizations and streaming pipelines.

Updated on March 6, 2025

Apache Airflow

Apache Iceberg

Apache Spark

Data Build Tool (DBT)

SQL - Advanced

SQL - Basics

SQL - Intermediate