Skip to content
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark
Start Learning
Start Learning
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark

Apache Airflow

2
  • Apache Airflow: What, Why, and How?
  • How to Deploy Apache Airflow on Kubernetes

Apache Iceberg

3
  • [01] – Introduction to Apache Iceberg
  • [02] – Getting Started with Apache Iceberg
  • [03] – Apache Iceberg Architecture

Apache Spark

4
  • [00] – Apache Spark By Example Course
  • [01] – What is Apache Spark?
  • [02] – Installing Apache Spark Locally
  • [03] – Deploy Apache Spark with Kubernetes (K8s)

Data Build Tool (DBT)

7
  • [00] – dbt by Example Course
  • [01] – dbt : What it is, Why and How?
  • [02] – Install dbt in local
  • [03] – Explore dbt Models
  • [04] – Sources in dbt
  • [05] – Seeds in dbt
  • [06] – Jinja Templates and Macros in dbt

SQL - Advanced

2
  • [02] – View vs Materialized View
  • [03] – Window function in SQL

SQL - Basics

1
  • 02 – Understanding SQL Operations (DML, DDL, DCL, and TCL)

SQL - Intermediate

1
  • SQL Joins: Understanding INNER, LEFT, RIGHT, and FULL Joins
  • Home
  • Docs
  • Data Processing
  • Apache Spark
  • [00] – Apache Spark By Example Course
View Categories

[00] – Apache Spark By Example Course

kerrache.massipssa

This tutorial series will guide you from the fundamentals of Apache Spark to advanced concepts, with hands-on examples to help you become proficient in big data processing. Each section will build on the previous one, using Scala in AWS Glue (or local Spark setups).

Part 1: Introduction to Apache Spark #

What is Apache Spark? #

  • Overview of Spark’s architecture
  • Differences between Spark and Hadoop MapReduce
  • Use cases in real-world applications

Setting Up Spark #

  • Installing Spark locally
  • Running Spark on Jupyter
  • Introduction to Spark Shell and Spark Submit

First Spark Application #

  • Writing and running a “Hello World” Spark job
  • Understanding RDDs, DataFrames, and Datasets

Part 2: Working with Data in Spark #

Spark Core Concepts #

  • Resilient Distributed Datasets (RDDs) vs. DataFrames vs. Datasets
  • Understanding transformations and actions
  • Lazy evaluation and DAG (Directed Acyclic Graph)

Data Ingestion in Spark #

  • Reading data from CSV, JSON, Parquet, and Databases
  • Loading data from AWS S3 (Glue Catalog)
  • Handling schema inference

Data Transformations #

  • Using map(), filter(), flatMap()
  • Performing groupBy(), reduceByKey(), aggregateByKey()
  • Joining datasets efficiently (join, broadcast join, map-side join)

Writing Data Back #

  • Writing to CSV, Parquet, and JSON
  • Partitioning and bucketing for optimized queries
  • Writing to AWS S3, Glue Catalog, and Databases

Part 3: Advanced Transformations and Optimization #

Spark SQL and DataFrames API #

  • Creating and querying DataFrames
  • Registering temporary views and using SQL queries
  • Performance comparison: RDDs vs. DataFrames vs. Datasets

Optimization Techniques #

  • Catalyst Optimizer & Tungsten Execution Engine
  • Understanding lazy execution and Spark jobs
  • Repartitioning vs. Coalescing for performance
  • Using persist() and cache() effectively

Working with UDFs (User-Defined Functions) #

  • Writing custom Scala UDFs
  • Using Pandas UDFs for performance (Spark 3.0+)

Part 4: Spark Streaming & Real-Time Data Processing #

Introduction to Structured Streaming #

  • Difference between batch processing and stream processing
  • Understanding the micro-batch model

Building a Real-Time Streaming Pipeline #

  • Reading from Kafka, AWS Kinesis
  • Writing to S3, Databases, NoSQL stores (DynamoDB, MongoDB)

Stateful Processing #

  • Handling watermarking, aggregation, and window functions
  • Using checkpointing and stateful operations

By the end of this course, you will have a strong understanding of Apache Spark and its real-world applications, from basic data transformations to advanced optimizations and streaming pipelines.

Updated on March 6, 2025

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Table of Contents
  • Part 1: Introduction to Apache Spark
    • What is Apache Spark?
    • Setting Up Spark
    • First Spark Application
  • Part 2: Working with Data in Spark
    • Spark Core Concepts
    • Data Ingestion in Spark
    • Data Transformations
    • Writing Data Back
  • Part 3: Advanced Transformations and Optimization
    • Spark SQL and DataFrames API
    • Optimization Techniques
    • Working with UDFs (User-Defined Functions)
  • Part 4: Spark Streaming & Real-Time Data Processing
    • Introduction to Structured Streaming
    • Building a Real-Time Streaming Pipeline
    • Stateful Processing

Copyright © 2025 MasterData

Powered by MasterData

Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}