Apache Spark is one of the most used distributed engines to deal with large amounts of data. Multiple tools can be used to run Spark: Spark Standalone, Apache Hadoop Yarn, Apache Mesos, and Kubernetes.
In this hands-on article, we’ll show how to use spark-k8s-operator to run Spark on Kubernetes.
Prerequisites
Before you start, make sure that the following tools are installed on your local system and that you have a fundamental understanding of Kubernetes and Spark.
- A local Kubernetes cluster with installation of Docker Desktop or MiniKube
- kubectl to manage Kubernetes resources
- Helm to deploy resources based on Helm charts
Deploy spark-operator on the cluster
To get started, add spark-operator to Helm repositories by running the following command.
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
Now, run the command below to install the spark-operator Helm chart.
helm install spark-on-k8s spark-operator/spark-operator --namespace spark-operator --create-namespace
Create service account
In this step, we will create a service account named spark which the spark-operator will use to connect to the Kubernetes Server API. Here’s how to set it up
- Generate an
# Create spark service account apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: default --- # Create spark-role apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: default name: spark-role rules: - apiGroups: [""] resources: ["pods", "configmap"] verbs: ["*"] - apiGroups: [""] resources: ["services"] verbs: ["*"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "get", "watch", "list"] --- # Bind spark service account to spark-role apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-role-binding namespace: default subjects: - kind: ServiceAccount name: spark namespace: default roleRef: kind: Role name: spark-role apiGroup: rbac.authorization.k8s.io
- Deploy the service account using the following command:
kubectl apply -f rbac.yaml
Deloy spark-pi application
In the example below, we will show you how to install Spark Operator to test the execution of the Spark Pi application in Kubernetes. Apply the YAML manifest file below.
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "3" mode: cluster image: "gcr.io/spark-operator/spark-py:v3.1.1" imagePullPolicy: Always mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py sparkVersion: "3.1.1" restartPolicy: type: OnFailure onFailureRetries: 3 onFailureRetryInterval: 10 onSubmissionFailureRetries: 5 onSubmissionFailureRetryInterval: 20 driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.1.1 serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.1.1
Submit the application
To submit the Spark application apply the manifest file created above as follows:
kubectl apply -f spark-pi.yaml
To verify that your Spark application has been successfully submitted, run:
kubectl get sparkapplications
Within a few moments, you should get an application named spark-pi, its status indicated as RUNNING.
NAME STATUS ATTEMPTS START FINISH AGE spark-pi RUNNING 1 2023-09-25T15:41:14Z <no value> 91s
Monitoring
After deploying the application, a service named spark-pi-ui-svc is created, hosting the Spark UI. Access it by forwarding the pod’s port to your local environment:
kubectl port-forward svc/spark-pi-ui-svc 4040
Now, you can access the Spark UI in your browser using the URL http://localhost:4040 and you should get the screen below.

Cleanup
If you need to uninstall the spark-operator Helm Charts Release, use the following command:
helm uninstall spark-operator
Conclusion
With these steps, you’ve successfully set up and deployed Apache Spark on a Kubernetes cluster using the spark-k8s-operator. This powerful combination empowers you to run Spark applications efficiently and manage them seamlessly. Explore the capabilities and possibilities of Spark on Kubernetes for your data processing needs.