View Categories

[02] – Installing Apache Spark Locally

In this tutorial, you will learn how to install Apache Spark / PySpark 3.5.4 on your local machine using the Apache Spark binary. The guide provides step-by-step instructions for Windows, macOS, and Linux.

Additionally, if you’re interested in deploying Apache Spark on Kubernetes, check out this tutorial: Deploy Apache Spark with Kubernetes (K8s).

Prerequisites #

Before installing Apache Spark, ensure you meet the following requirements:

  • Java 8 or later is required for Spark to run.
  • If you plan to write Spark jobs in Scala, you need to install Scala.
  • If you plan to write Spark jobs in Python, you need to install Python.

Install Apache Spark #

Step 1: Download Spark #

  1. Visit the official Apache Spark website.
  2. Select:
  • Spark version (latest stable)
  • Hadoop version (choose pre-built for Hadoop 3.x if unsure)
  • Download as a .tgz file.
  1. Extract the archive
tar -xvzf spark-*.tgz
mv spark-* ~/spark

Step 2: Set Environment Variables #

Now, you need to add Spark bin location to the environment variables.

Windows (PowerShell) #

  1. Open PowerShell.
  2. Set environment variables:
$env:SPARK_HOME="C:\path\to\spark"
$env:PATH+=";$env:SPARK_HOME\bin"

💡 SPARK PATH

Replace C:\path\to\spark with the actual location where you extracted the Spark binary on your system. For example, if you extracted Spark to C:\Spark, update the path accordingly.

Linux #

  1. Open the terminal and edit ~/.bashrc or ~/.zshrc:
nano ~/.bashrc
  1. Add:
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
  1. Apply changes:
source ~/.bashrc

Install PySpark (for Python users) #

pip install pyspark

Verify Installation #

To confirm that Apache Spark was installed correctly, you need to check if both spark-shell (for Scala) and pyspark (for Python) run without errors.

  • Check Spark Shell

Run the command spark-shell to check Scala Spark Shell:

install apacje spark
  • Check PySpark

Run pysaprk command to check PySpark (Python API):

install pyspark
  • Checkl Spark UI

While spark-shell or pyspark is running, open your preferred browser and navigate to: http://localhost:4040. This should display the Spark UI, which provides insights into running jobs, stages, and tasks

Conclusion #

By following this guide, you have successfully installed and configured Apache Spark locally on your machine. Whether you’re using Windows, macOS, or Linux, you now have a functional Spark environment ready for big data processing and analytics.

To deepen your knowledge, explore Spark’s core features, including RDDs, DataFrames, and SQL operations. If you plan to run Spark in a distributed environment, consider learning about YARN, Kubernetes, and cloud deployments.

Leave a Reply