How to Install Apache Spark on Windows

Apache Spark is a powerful distributed computing system used for big data processing, machine learning, and real-time analytics. While it is often deployed on clusters, you can also install it on a Windows machine for development and testing purposes.

In this blog, I’ll walk you through a step-by-step guide on how to install Apache Spark on a Windows system. By the end, you’ll have a fully functional Spark environment on your local machine.


Table of Contents

  1. Prerequisites
  2. Step 1: Download and Install Java Development Kit (JDK)
  3. Step 2: Install Apache Spark
  4. Step 3: Install Hadoop (Required for WinUtils)
  5. Step 4: Configure Environment Variables
  6. Step 5: Verify the Installation
  7. Conclusion

Prerequisites

Before starting, make sure you have the following:

  • Windows 10/11 (64-bit)
  • Administrator rights to change environment variables
  • Internet connection to download required files


Step 1: Download and Install Java Development Kit (JDK)

Apache Spark requires a Java Runtime Environment (JRE) to run. We will install the JDK to get both the development tools and the runtime.

  1. Download JDK:
    • Go to Oracle JDK Downloads.
    • Select the appropriate JDK version (preferably JDK 11 or JDK 8) and download the Windows x64 Installer.
  2. Install JDK:
    • Run the installer and follow the instructions.
    • Select the installation path (default is C:\Program Files\Java\jdk-XX.X.X).
    • Complete the installation.
  3. Set up the JAVA_HOME environment variable:
    • Right-click This PC or My Computer, then click Properties.
    • Click Advanced system settings > Environment Variables.
    • Add a new System Variable with the following:
      • Variable Name: JAVA_HOME
      • Variable Value: C:\Program Files\Java\jdk-XX.X.X (path to your JDK installation)
    • Edit the Path variable and add %JAVA_HOME%\bin at the end of the path.
  4. Verify the Java installation:
    • Open Command Prompt and type:cmdCopy codejava -version
    • You should see the version of Java installed.

Step 2: Install Apache Spark

Now that we have Java installed, let’s move on to installing Apache Spark.

  1. Download Apache Spark:
    • Visit the Apache Spark official website.
    • Select the latest Spark version (e.g., 3.5.3) and choose Pre-built for Apache Hadoop as the package type.
    • Download the .tgz file and extract it using a tool like 7-Zip to C:\spark.
  2. Extract Apache Spark:
    • Extract the Spark .tgz file to C:\spark.
    • The path should look like C:\spark\spark-3.5.3-bin-hadoop3.

Step 3: Install Hadoop (Required for WinUtils)

Since Spark on Windows needs access to WinUtils (a Hadoop utility), you must have Hadoop binaries for Windows.

  1. Download WinUtils:
  2. Create Hadoop Directory:
    • Create a folder called hadoop in C:\.
    • Extract winutils.exe into C:\hadoop\bin.
  3. Add Hadoop to Environment Variables:
    • Open System Properties > Advanced > Environment Variables.
    • Add a new system variable:
      • Variable Name: HADOOP_HOME
      • Variable Value: C:\hadoop
    • Add %HADOOP_HOME%\bin to the Path variable.


Step 4: Configure Environment Variables

To run Spark smoothly on Windows, we need to configure environment variables for Spark and Hadoop.

  1. Set SPARK_HOME:
    • Open Environment Variables.
    • Add a new system variable:
      • Variable Name: SPARK_HOME
      • Variable Value: C:\spark\spark-3.5.3-bin-hadoop3
  2. Add SPARK_HOME to the Path:
    • Edit the Path variable.
    • Add %SPARK_HOME%\bin and %SPARK_HOME%\sbin to the end of the Path variable.
  3. Set HADOOP_HOME (if not set earlier):
    • Add a new system variable:
      • Variable Name: HADOOP_HOME
      • Variable Value: C:\hadoop
  4. Set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON (Optional):
    • If you want to run Spark with Python, you need to install Python and point Spark to it.
    • Add two new system variables:
      • Variable Name: PYSPARK_PYTHON
      • Variable Value: C:\Path\To\Python\python.exe
      • Variable Name: PYSPARK_DRIVER_PYTHON
      • Variable Value: C:\Path\To\Python\python.exe


Step 5: Verify the Installation

Now, it’s time to test the installation and make sure Spark is working as expected.

  1. Open Command Prompt:
    • Type: CMD – spark-shell
  2. Start PySpark (for Python):
    • Type: CMD – pyspark
  3. Check Spark Context:
    • When Spark starts, it will create a SparkContext object called sc.
    • You can type:scalaCopy codesc.version
  4. Run a Sample Program:
    • Run the following to count the number of elements in an RDD: scala val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.count()
  5. Exit Spark:
    • To exit the Spark shell, type: :quit


Common Issues and Troubleshooting

Here are some common issues and how to resolve them:

  1. Java not found:
    • Make sure JAVA_HOME is set correctly.
    • Run java -version to confirm Java is installed and configured.
  2. winutils.exe not found:
    • Make sure winutils.exe is in C:\hadoop\bin.
    • Ensure %HADOOP_HOME%\bin is in the Path variable.
  3. spark-shell not recognized:
    • Check if %SPARK_HOME%\bin is in the Path variable.
    • Restart the system if environment variables were changed.
  4. PySpark not launching:
    • Ensure PYSPARK_PYTHON is set to the Python installation path.


Conclusion

Congratulations! 🎉 You have successfully installed Apache Spark on Windows. You can now experiment with Spark using both Scala and Python (PySpark).

This setup is perfect for local development, learning, and testing Spark-based projects. With your local Spark environment ready, you can now create RDDs, DataFrames, and run machine learning models right from your Windows machine.

Next Steps

  • Try building a simple ETL pipeline using PySpark.
  • Experiment with DataFrame transformations and actions.
  • Connect Spark to Jupyter Notebook for an interactive coding experience.

If you have any questions or issues with the installation, drop a comment below, and I’ll be happy to help. 🚀

Happy Coding!

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *