Apache Spark is one of the most used distributed engines to deal with large amounts of data. Multiple tools can be used to run Spark: Spark Standalone, Apache Hadoop Yarn, Apache Mesos, and Kubernetes.
In this hands-on article, we’ll show how to use spark-k8s-operator to run Spark on Kubernetes.
Prerequisites #
Before you start, make sure that the following tools are installed on your local system and that you have a fundamental understanding of Kubernetes and Spark.
- A local Kubernetes cluster with installation of Docker Desktop or MiniKube
- kubectl to manage Kubernetes resources
- Helm to deploy resources based on Helm charts
Deploy spark-operator on the cluster #
To get started, add spark-operator to Helm repositories by running the following command.
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
Now, run the command below to install the spark-operator Helm chart.
helm install spark-on-k8s spark-operator/spark-operator --namespace spark-operator --create-namespace
Create service account #
In this step, we will create a service account named spark
which the spark-operator will use to connect to the Kubernetes Server API. Here’s how to set it up
- Generate an
rbac.yaml
file and insert the following content into it.
# Create spark service account apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: default --- # Create spark-role apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: default name: spark-role rules: - apiGroups: [""] resources: ["pods", "configmap"] verbs: ["*"] - apiGroups: [""] resources: ["services"] verbs: ["*"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "get", "watch", "list"] --- # Bind spark service account to spark-role apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-role-binding namespace: default subjects: - kind: ServiceAccount name: spark namespace: default roleRef: kind: Role name: spark-role apiGroup: rbac.authorization.k8s.io
- Deploy the service account using the following command:
kubectl apply -f rbac.yaml
Deloy spark-pi application #
In the example below, we will show you how to install Spark Operator to test the execution of the Spark Pi application in Kubernetes. Apply the YAML manifest file below.
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "3" mode: cluster image: "gcr.io/spark-operator/spark-py:v3.1.1" imagePullPolicy: Always mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py sparkVersion: "3.1.1" restartPolicy: type: OnFailure onFailureRetries: 3 onFailureRetryInterval: 10 onSubmissionFailureRetries: 5 onSubmissionFailureRetryInterval: 20 driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.1.1 serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.1.1
Submit the application #
To submit the Spark application apply the manifest file created above as follows:
kubectl apply -f spark-pi.yaml
To verify that your Spark application has been successfully submitted, run:
kubectl get sparkapplications
Within a few moments, you should get an application named spark-pi, its status indicated as RUNNING.
NAME STATUS ATTEMPTS START FINISH AGE spark-pi RUNNING 1 2023-09-25T15:41:14Z <no value> 91s
Monitoring #
After deploying the application, a service named spark-pi-ui-svc is created, hosting the Spark UI. Access it by forwarding the pod’s port to your local environment:
kubectl port-forward svc/spark-pi-ui-svc 4040
Now, you can access the Spark UI in your browser using the URL http://localhost:4040 and you should get the screen below.

Cleanup #
If you need to uninstall the spark-operator Helm Charts Release, use the following command:
helm uninstall spark-operator
Conclusion #
With these steps, you’ve successfully set up and deployed Apache Spark on a Kubernetes cluster using the spark-k8s-operator. This powerful combination empowers you to run Spark applications efficiently and manage them seamlessly. Explore the capabilities and possibilities of Spark on Kubernetes for your data processing needs.