Apache Spark is a powerful distributed computing system used for big data processing, machine learning, and real-time analytics. While it is often deployed on clusters, you can also install it on a Windows machine for development and testing purposes.
In this blog, I’ll walk you through a step-by-step guide on how to install Apache Spark on a Windows system. By the end, you’ll have a fully functional Spark environment on your local machine.
Table of Contents
- Prerequisites
- Step 1: Download and Install Java Development Kit (JDK)
- Step 2: Install Apache Spark
- Step 3: Install Hadoop (Required for WinUtils)
- Step 4: Configure Environment Variables
- Step 5: Verify the Installation
- Conclusion
Prerequisites
Before starting, make sure you have the following:
- Windows 10/11 (64-bit)
- Administrator rights to change environment variables
- Internet connection to download required files
Step 1: Download and Install Java Development Kit (JDK)
Apache Spark requires a Java Runtime Environment (JRE) to run. We will install the JDK to get both the development tools and the runtime.
- Download JDK:
- Go to Oracle JDK Downloads.
- Select the appropriate JDK version (preferably JDK 11 or JDK 8) and download the Windows x64 Installer.
- Install JDK:
- Run the installer and follow the instructions.
- Select the installation path (default is
C:\Program Files\Java\jdk-XX.X.X
). - Complete the installation.
- Set up the JAVA_HOME environment variable:
- Right-click This PC or My Computer, then click Properties.
- Click Advanced system settings > Environment Variables.
- Add a new System Variable with the following:
- Variable Name:
JAVA_HOME
- Variable Value:
C:\Program Files\Java\jdk-XX.X.X
(path to your JDK installation)
- Variable Name:
- Edit the Path variable and add
%JAVA_HOME%\bin
at the end of the path.
- Verify the Java installation:
- Open Command Prompt and type:cmdCopy code
java -version
- You should see the version of Java installed.
- Open Command Prompt and type:cmdCopy code
Step 2: Install Apache Spark
Now that we have Java installed, let’s move on to installing Apache Spark.
- Download Apache Spark:
- Visit the Apache Spark official website.
- Select the latest Spark version (e.g., 3.5.3) and choose Pre-built for Apache Hadoop as the package type.
- Download the
.tgz
file and extract it using a tool like 7-Zip toC:\spark
.
- Extract Apache Spark:
- Extract the Spark
.tgz
file toC:\spark
. - The path should look like
C:\spark\spark-3.5.3-bin-hadoop3
.
- Extract the Spark
Step 3: Install Hadoop (Required for WinUtils)
Since Spark on Windows needs access to WinUtils (a Hadoop utility), you must have Hadoop binaries for Windows.
- Download WinUtils:
- Go to the WinUtils repository on GitHub and download the appropriate version for Hadoop 3.x.
- Create Hadoop Directory:
- Create a folder called
hadoop
inC:\
. - Extract winutils.exe into
C:\hadoop\bin
.
- Create a folder called
- Add Hadoop to Environment Variables:
- Open System Properties > Advanced > Environment Variables.
- Add a new system variable:
- Variable Name:
HADOOP_HOME
- Variable Value:
C:\hadoop
- Variable Name:
- Add
%HADOOP_HOME%\bin
to the Path variable.
Step 4: Configure Environment Variables
To run Spark smoothly on Windows, we need to configure environment variables for Spark and Hadoop.
- Set SPARK_HOME:
- Open Environment Variables.
- Add a new system variable:
- Variable Name:
SPARK_HOME
- Variable Value:
C:\spark\spark-3.5.3-bin-hadoop3
- Variable Name:
- Add SPARK_HOME to the Path:
- Edit the Path variable.
- Add
%SPARK_HOME%\bin
and%SPARK_HOME%\sbin
to the end of the Path variable.
- Set HADOOP_HOME (if not set earlier):
- Add a new system variable:
- Variable Name:
HADOOP_HOME
- Variable Value:
C:\hadoop
- Variable Name:
- Add a new system variable:
- Set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON (Optional):
- If you want to run Spark with Python, you need to install Python and point Spark to it.
- Add two new system variables:
- Variable Name:
PYSPARK_PYTHON
- Variable Value:
C:\Path\To\Python\python.exe
- Variable Name:
PYSPARK_DRIVER_PYTHON
- Variable Value:
C:\Path\To\Python\python.exe
- Variable Name:
Step 5: Verify the Installation
Now, it’s time to test the installation and make sure Spark is working as expected.
- Open Command Prompt:
- Type: CMD –
spark-shell
- Type: CMD –
- Start PySpark (for Python):
- Type: CMD –
pyspark
- Type: CMD –
- Check Spark Context:
- When Spark starts, it will create a
SparkContext
object calledsc
. - You can type:scalaCopy code
sc.version
- When Spark starts, it will create a
- Run a Sample Program:
- Run the following to count the number of elements in an RDD: scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.count()
- Run the following to count the number of elements in an RDD: scala
- Exit Spark:
- To exit the Spark shell, type:
:quit
- To exit the Spark shell, type:
Common Issues and Troubleshooting
Here are some common issues and how to resolve them:
- Java not found:
- Make sure
JAVA_HOME
is set correctly. - Run
java -version
to confirm Java is installed and configured.
- Make sure
- winutils.exe not found:
- Make sure winutils.exe is in
C:\hadoop\bin
. - Ensure
%HADOOP_HOME%\bin
is in the Path variable.
- Make sure winutils.exe is in
- spark-shell not recognized:
- Check if
%SPARK_HOME%\bin
is in the Path variable. - Restart the system if environment variables were changed.
- Check if
- PySpark not launching:
- Ensure PYSPARK_PYTHON is set to the Python installation path.
Conclusion
Congratulations! 🎉 You have successfully installed Apache Spark on Windows. You can now experiment with Spark using both Scala and Python (PySpark).
This setup is perfect for local development, learning, and testing Spark-based projects. With your local Spark environment ready, you can now create RDDs, DataFrames, and run machine learning models right from your Windows machine.
Next Steps
- Try building a simple ETL pipeline using PySpark.
- Experiment with DataFrame transformations and actions.
- Connect Spark to Jupyter Notebook for an interactive coding experience.
If you have any questions or issues with the installation, drop a comment below, and I’ll be happy to help. 🚀
Happy Coding!