Skip to content
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark
Start Learning
Start Learning
MasterData
  • Home
  • All Courses
  • Blog
    • Apache Spark

Apache Airflow

2
  • Apache Airflow: What, Why, and How?
  • How to Deploy Apache Airflow on Kubernetes

Apache Iceberg

3
  • [01] – Introduction to Apache Iceberg
  • [02] – Getting Started with Apache Iceberg
  • [03] – Apache Iceberg Architecture

Apache Spark

4
  • [00] – Apache Spark By Example Course
  • [01] – What is Apache Spark?
  • [02] – Installing Apache Spark Locally
  • [03] – Deploy Apache Spark with Kubernetes (K8s)

Data Build Tool (DBT)

7
  • [00] – dbt by Example Course
  • [01] – dbt : What it is, Why and How?
  • [02] – Install dbt in local
  • [03] – Explore dbt Models
  • [04] – Sources in dbt
  • [05] – Seeds in dbt
  • [06] – Jinja Templates and Macros in dbt

SQL - Advanced

2
  • [02] – View vs Materialized View
  • [03] – Window function in SQL

SQL - Basics

1
  • 02 – Understanding SQL Operations (DML, DDL, DCL, and TCL)

SQL - Intermediate

1
  • SQL Joins: Understanding INNER, LEFT, RIGHT, and FULL Joins
  • Home
  • Docs
  • Data Processing
  • Apache Spark
  • [03] – Deploy Apache Spark with Kubernetes (K8s)
View Categories

[03] – Deploy Apache Spark with Kubernetes (K8s)

kerrache.massipssa

Apache Spark is one of the most used distributed engines to deal with large amounts of data. Multiple tools can be used to run Spark: Spark Standalone, Apache Hadoop Yarn, Apache Mesos, and Kubernetes.

In this hands-on article, we’ll show how to use spark-k8s-operator to run Spark on Kubernetes.

Prerequisites #

Before you start, make sure that the following tools are installed on your local system and that you have a fundamental understanding of Kubernetes and Spark.

  • A local Kubernetes cluster with installation of Docker Desktop or MiniKube
  • kubectl to manage Kubernetes resources
  • Helm to deploy resources based on Helm charts

Deploy spark-operator on the cluster #

To get started, add spark-operator to Helm repositories by running the following command.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

Now, run the command below to install the spark-operator Helm chart.

helm install spark-on-k8s spark-operator/spark-operator --namespace spark-operator --create-namespace

Create service account #

In this step, we will create a service account named spark which the spark-operator will use to connect to the Kubernetes Server API. Here’s how to set it up

  • Generate an rbac.yaml file and insert the following content into it.
# Create spark service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
# Create spark-role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods", "configmap"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "get", "watch", "list"]
---
# Bind spark service account to spark-role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
  • Deploy the service account using the following command:
kubectl apply -f rbac.yaml

Deloy spark-pi application #

In the example below, we will show you how to install Spark Operator to test the execution of the Spark Pi application in Kubernetes. Apply the YAML manifest file below.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-pi
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "gcr.io/spark-operator/spark-py:v3.1.1"
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: "3.1.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1

Submit the application #

To submit the Spark application apply the manifest file created above as follows:

kubectl apply -f spark-pi.yaml

To verify that your Spark application has been successfully submitted, run:

kubectl get sparkapplications

Within a few moments, you should get an application named spark-pi, its status indicated as RUNNING.

NAME       STATUS   ATTEMPTS   START                  FINISH      AGE
spark-pi   RUNNING  1         2023-09-25T15:41:14Z   <no value>  91s

Monitoring #

After deploying the application, a service named spark-pi-ui-svc is created, hosting the Spark UI. Access it by forwarding the pod’s port to your local environment:

kubectl port-forward svc/spark-pi-ui-svc 4040

Now, you can access the Spark UI in your browser using the URL http://localhost:4040 and you should get the screen below.

Deploy Apache Spark with Kubernetes (K8s)
Spark UI

Cleanup #

If you need to uninstall the spark-operator Helm Charts Release, use the following command:

helm uninstall spark-operator

Conclusion #

With these steps, you’ve successfully set up and deployed Apache Spark on a Kubernetes cluster using the spark-k8s-operator. This powerful combination empowers you to run Spark applications efficiently and manage them seamlessly. Explore the capabilities and possibilities of Spark on Kubernetes for your data processing needs.

Updated on March 5, 2025

Leave a Reply Cancel reply

You must be logged in to post a comment.

Table of Contents
  • Prerequisites
  • Deploy spark-operator on the cluster
  • Create service account
  • Deloy spark-pi application
  • Submit the application
  • Monitoring
  • Cleanup
  • Conclusion

Copyright © 2025 MasterData

Powered by MasterData

Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}