As a full-stack developer and Linux expert, I often need to work with large datasets for machine learning, data analysis, and processing big data workloads. Apache Spark has emerged as one of the most popular frameworks for these use cases due to its speed, ease of use, and unified engine for both batch and stream data processing.
In this comprehensive guide, I will walk you through the entire process of installing Apache Spark 3.3.1 on Ubuntu 20.04 from scratch. I will cover the prerequisites, download and setup, environment configuration, starting the Spark cluster, using the Spark shell, running Python, and more.
Prerequisites
Before we get started with the Spark installation, we need to set up the following prerequisites:
- Ubuntu 20.04 server with at least 4 GB of RAM
- A non-root user account with sudo privileges
- Java 8 or 11 installed
I will use the non-root user "sparkuser" in this guide. Make sure you log in to your Ubuntu server as this user to follow along.
Let‘s start by updating the package index and installing Java 11:
sudo apt update
sudo apt install openjdk-11-jdk
Downloading Apache Spark
The first step is to download the latest release of Apache Spark from the official website. As of this writing, the latest stable version is Spark 3.3.1 built for Hadoop 3.2.
cd ~/Downloads
wget https://downloads.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.2.tgz
This will download the Spark 3.3.1 binary package into the ~/Downloads directory.
Next, we extract the downloaded .tgz archive:
tar xzf spark-3.3.1-bin-hadoop3.2.tgz
And move the extracted folder to the /opt/ directory:
sudo mv spark-3.3.1-bin-hadoop3.2 /opt/spark
Configuring Environment Variables
We need to configure some environment variables for Spark to work properly.
Open the .bashrc file using nano or your preferred text editor:
nano ~/.bashrc
And add the following lines at the end:
# Apache Spark Environment Variables
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save the file, exit the editor, and reload .bashrc:
source ~/.bashrc
This sets the $SPARK_HOME variable to our install location and appends the Spark binaries path to the system $PATH.
Starting a Standalone Spark Cluster
By default, Spark runs as a standalone cluster manager. This means the driver and workers will be launched on the same machine without any external cluster managers like YARN or Mesos.
We will start a standalone Spark cluster with one worker instance for demonstration.
First, start the Spark master server which acts as the cluster manager:
start-master.sh
Then in another terminal, start a Spark worker that connects to this master:
start-worker.sh spark://localhost:7077
We pass in the master‘s URL with port 7077.
The worker will now automatically register itself with the master.
We can check the status of our standalone cluster at the web UI: http://localhost:8080
Using the Spark Shell and PySpark
With the Spark cluster running, we can now make use of the Spark shell and PySpark to test out the setup.
Open another terminal and launch the Scala Spark shell:
spark-shell
This will open an interactive shell where we can run Spark code written in Scala. Let‘s test with a simple command:
sc.version
// res1: String = 3.3.1
The spark context sc is automatically created. We use it to print the current Spark version.
Next, let‘s try PySpark which allows us to run Spark with Python:
pyspark
Similar to the Scala shell, a SparkContext is made available automatically as sc. Let‘s print the version:
sc.version
# ‘3.3.1‘
So PySpark works correctly with our Python 3 installation.
Starting and Stopping the Cluster
We can start or stop the entire Spark cluster, including the master and all workers using the start-all.sh and stop-all.sh scripts:
start-all.sh
stop-all.sh
To manage just the master itself:
start-master.sh
stop-master.sh
And for an individual worker instance:
start-worker.sh <args>
stop-worker.sh
This makes it very easy to start, stop and scale up or down our Spark cluster.
Conclusion
In this expert guide, I covered the full process for downloading, installing, and configuring Apache Spark 3.3.1 on Ubuntu 20.04 from the ground up. You learned how to:
- Install prerequisites like Java
- Download and setup the latest Spark release
- Configure environment variables
- Start a standalone Spark cluster
- Use the interactive Spark shell and PySpark
- Start, stop and manage the Spark master and workers
With Spark available on your Ubuntu server, you can now use this high-performance distributed framework for all your big data processing and analytics workloads. I hope you found this step-by-step expert guide useful. Let me know if you have any questions!


