Skip to content
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark
Start Learning
Start Learning
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark

Apache Airflow

2
  • Apache Airflow: What, Why, and How?
  • How to Deploy Apache Airflow on Kubernetes

Apache Iceberg

3
  • [01] – Introduction to Apache Iceberg
  • [02] – Getting Started with Apache Iceberg
  • [03] – Apache Iceberg Architecture

Apache Spark

4
  • [00] – Apache Spark By Example Course
  • [01] – What is Apache Spark?
  • [02] – Installing Apache Spark Locally
  • [03] – Deploy Apache Spark with Kubernetes (K8s)

Data Build Tool (DBT)

7
  • [00] – dbt by Example Course
  • [01] – dbt : What it is, Why and How?
  • [02] – Install dbt in local
  • [03] – Explore dbt Models
  • [04] – Sources in dbt
  • [05] – Seeds in dbt
  • [06] – Jinja Templates and Macros in dbt

SQL - Advanced

2
  • [02] – View vs Materialized View
  • [03] – Window function in SQL

SQL - Basics

1
  • 02 – Understanding SQL Operations (DML, DDL, DCL, and TCL)

SQL - Intermediate

1
  • SQL Joins: Understanding INNER, LEFT, RIGHT, and FULL Joins
  • Home
  • Docs
  • Data Processing
  • Apache Spark
  • [01] – What is Apache Spark?
View Categories

[01] – What is Apache Spark?

kerrache.massipssa

Apache Spark is a unified computing engine designed for parallel data processing on clusters. It is one of the most actively developed open-source tools for big data, supporting multiple programming languages like Python, Java, Scala, and R. With built-in libraries for SQL, streaming, and machine learning, Spark is highly scalable, running efficiently from a single laptop to massive server clusters, making it a versatile choice for developers and data scientists.

Spark Components #

To understand how Spark works, let’s first explore its core components:

Apache Spark Componenents
  • Spark Core: The foundation of Spark, providing essential functionalities and APIs.
  • Spark SQL: Enables querying structured data using SQL commands.
  • Spark Streaming: Allows real-time data stream processing.
  • MLlib: A scalable library for machine learning model development.
  • GraphX: Facilitates graph processing and analytics.

Architecture #

Now that we have an overview of Spark’s components, let’s dive into its architecture and understand how it operates at scale.

Apache Spark Architecture

Now, let’s explore each component.

Cluster manager #

A cluster manager in Spark is a system that allocates and manages computing resources across multiple machines in a cluster. It decides how to distribute tasks and provides the necessary CPU and memory for Spark applications. Examples include Spark’s standalone cluster manager, YARN, Kubernetes and Mesos.

Spark Nodes #

Apache Spark consists of two main types of nodes that work together to execute applications efficiently.

Driver Node #

The Driver is the central component that initiates, manages, and oversees the execution of a Spark application. It:

  • Creates the Spark Session.
  • Distributes tasks to executors across the cluster.
  • Collects and consolidates results.
  • Manages communication with the cluster’s resources.

Worker Nodes #

A worker node is responsible for executing Spark jobs. It hosts executors, which are Java Virtual Machine (JVM) processes that perform the actual computations. Multiple executors can run on a single worker node, enhancing parallelism and efficiency.

Execution Modes #

Spark can run applications using different execution environments, either with its own scheduler or an external cluster manager for resource management.

  1. Standalone Scheduler: Spark can run using its built-in standalone scheduler, which manages a dedicated Spark cluster without requiring an external resource manager.
  2. Cluster Manager: Spark can also run on an external cluster manager (such as YARN, Mesos, or Kubernetes) that handles resource allocation and workload management across the cluster.

Run mode #

When running a Spark application, there are two execution modes available:

  1. Client Mode: The driver runs on the machine that submits the application, which can be a local machine or the master node of the cluster.
  2. Cluster Mode: The driver runs within the cluster, typically on one of the worker nodes, allowing for better resource management and scalability.

How Data is Handled? #

Now that we understand execution, let’s explore how Spark organizes and processes data efficiently.

Data Structures in Spark #

  • Resilient Distributed Datasets (RDDs)

RDDs are the fundamental building blocks of Spark. They are fault-tolerant, parallel data structures that enable efficient distributed data processing across clusters.

  • DataFrame

A DataFrame is a distributed, tabular collection of data organized into named columns. It provides a high-level abstraction for structured data operations, supporting SQL-like syntax for queries.

  • DataSet

Datasets extend the DataFrame API by adding type safety and an object-oriented programming interface. They are available only in Scala and Java and help optimize performance through compile-time type checking.

Partitions #

Partitioning in Spark refers to the division of data into smaller, more manageable chunks known as partitions. Partitions are the basic units of parallelism in Spark, and they allow the framework to process different portions of the data simultaneously on different nodes in a cluster. A partition is a collection of rows that sit on one physical machine in your cluster

In the example below, the DataFrame on the left is divided into three partitions: P1, P2, and P3, with each partition containing two rows. Partition P1 is assigned to Executor 1, while P2 and P3 are assigned to Executor 2.

Partitioned DataFrame

⚠️ SET THE RIGHT PARTITION NUMBER

Too few partitions lead to large partition sizes, potentially causing disk spills and slowing down queries. Too many partitions result in small partition sizes, leading to inefficient I/O patterns and an increased burden on the Spark task scheduler. Setting the right number of partitions is a key perfoamnce for your Spark Applications.

Transformations #

A transformation in Spark is an operation that defines how data should be processed but does not execute immediately. Instead of modifying data directly, transformations create a new DataFrame or RDD with the specified changes. Spark only processes the transformations when an action is triggered.

  • Narrow Transformations

A narrow transformation means each input partition contributes to only one output partition. Since data remains on the same worker node, these transformations are faster and more efficient. Example: .filter(), .map().

Narrow Transformations
  • Wide Transformations (Shuffles)

A wide transformation means data from one partition is needed by multiple output partitions, requiring data to be shuffled across the cluster. This makes them more resource-intensive. Example: .groupBy(), .join().

Wide Transformations (Shuffles)

⚠️ Be aware that when using wide transformations, they can significantly slow down your applications.

Actions #

An action in Spark is a command that tells Spark to execute the computation and return a result. While transformations only define how data should be processed, actions actually trigger Spark to perform the work.

For example, calling .count() on a DataFrame tells Spark to count the number of rows and return the result. Actions are what make Spark start processing data.

💡 The output of transformations is always an RDD, whereas the output of actions includes everything except an RDD.

Lazy evaluation #

Lazy evaluation means that Spark delays executing computations until absolutely necessary. Instead of processing data immediately when an operation is specified, Spark first builds a plan of all transformations applied to the source data. This plan is then optimized and converted into a final execution strategy just before running the job.

Why does Spark use lazy evaluation? #

This approach allows Spark to optimize the entire workflow before execution, leading to better performance. For example, Spark can apply predicate pushdown, which means it filters out unnecessary data before loading it, reducing the amount of processing needed. This makes large-scale data processing much more efficient.

Monitor and Troublshoot #

To track job progress, you can use the Spark web UI, accessible by default at the port 4040 of the driver node. This interface provides insights into Spark job status, environment details, and cluster state. Particularly valuable for tuning and debugging, the Spark UI displays essential information.

Apache Spark UI

Example #

This is a simple PySpark script demonstrating how to create and manipulate a DataFrame using Spark.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SparkExample") \
    .getOrCreate()

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
columns = ["Name", "Age"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Filter data
filtered_df = df.filter(df["Age"] > 25)
filtered_df.show()

# Stop the SparkSession
spark.stop()
Updated on March 6, 2025

Leave a Reply Cancel reply

You must be logged in to post a comment.

Table of Contents
  • Spark Components
  • Architecture
    • Cluster manager
    • Spark Nodes
      • Driver Node
      • Worker Nodes
    • Execution Modes
    • Run mode
  • How Data is Handled?
    • Data Structures in Spark
  • Partitions
    • Transformations
    • Actions
  • Lazy evaluation
    • Why does Spark use lazy evaluation?
    • Monitor and Troublshoot
  • Example

Copyright © 2025 MasterData

Powered by MasterData

Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}