Deploy Apache Spark with Kubernetes (K8s)

Apache Spark is one of the most used distributed engines to deal with large amounts of data. Multiple tools can be used to run Spark: Spark Standalone, Apache Hadoop Yarn, Apache Mesos, and Kubernetes.

In this hands-on article, we’ll show how to use spark-k8s-o perator to run Spark on Kubernetes.

Prerequisites

Before you start, make sure that the following tools are installed on your local system and that you have a fundamental understanding of Kubernetes and Spark.

A local Kubernetes cluster with installation of Docker Desktop or MiniKube
kubectl to manage Kubernetes resources
Helm to deploy resources based on Helm charts

Deploy spark-operator on the cluster

To get started, add spark-operator to Helm repositories by running the following command.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

Now, run the command below to install the spark-operator Helm chart.

helm install spark-on-k8s spark-operator/spark-operator --namespace spark-operator --create-namespace

Create service account

In this step, we will create a service account named spark which the spark-operator will use to connect to the Kubernetes Server API. Here’s how to set it up

Generate an rbac.yaml file and insert the following content into it.

# Create spark service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
# Create spark-role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods", "configmap"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "get", "watch", "list"]
---
# Bind spark service account to spark-role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

Deploy the service account using the following command:

kubectl apply -f rbac.yaml

Deloy spark-pi application

In the example below, we will show you how to install Spark Operator to test the execution of the Spark Pi application in Kubernetes. Apply the YAML manifest file below.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-pi
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "gcr.io/spark-operator/spark-py:v3.1.1"
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: "3.1.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1

Submit the application

To submit the Spark application apply the manifest file created above as follows:

kubectl apply -f spark-pi.yaml

To verify that your Spark application has been successfully submitted, run:

kubectl get sparkapplications

Within a few moments, you should get an application named spark-pi, its status indicated as RUNNING.

NAME       STATUS   ATTEMPTS   START                  FINISH      AGE
spark-pi   RUNNING  1         2023-09-25T15:41:14Z   <no value>  91s

Monitoring

After deploying the application, a service named spark-pi-ui-svc is created, hosting the Spark UI. Access it by forwarding the pod’s port to your local environment:

kubectl port-forward svc/spark-pi-ui-svc 4040

Now, you can access the Spark UI in your browser using the URL http://localhost:4040 and you should get the screen below.

Deploy Apache Spark with Kubernetes (K8s) — Spark UI

Cleanup

If you need to uninstall the spark-operator Helm Charts Release, use the following command:

helm uninstall spark-operator

Conclusion

With these steps, you’ve successfully set up and deployed Apache Spark on a Kubernetes cluster using the spark-k8s-operator. This powerful combination empowers you to run Spark applications efficiently and manage them seamlessly. Explore the capabilities and possibilities of Spark on Kubernetes for your data processing needs.