Basic Hadoop and Spark config files

This simple configuration is for a MapReduce framework on a Linux server including

Apache Hadoop
Apache Spark standalone

If you want to quickly deploy a Spark cluster on a Slurm server as a regular user, look at https://github.com/feng-li/spark-on-slurm.

Prerequisites

Make sure necessary environment variables are set. If you have the access to /etc/environment, you could write there and it will take effect gloablly. Otherwise, you have to write them to a file (usually in ~/.bashrc or ~/.zshrc) and source it before you start the servers.

## NOTE: /etc/environment dose not support `$` expantion.

JAVA_HOME=/usr/lib/jvm/default-java/
HADOOP_HOME=/soft/APP/hadoop
SPARK_HOME=/soft/APP/spark
HADOOP_CONF_DIR=/soft/APP/hadoop-spark-conf/hadoop/etc/hadoop
SPARK_CONF_DIR=/soft/APP/hadoop-spark-conf/spark/conf

PATH=/soft/APP/hadoop/bin:/soft/APP/spark/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Tuning Spark

Use the Kryo library to serialize objects. Set a faster serializer for Java serialization in the spark-defaults.conf could often speed up as much as 10x.
```
spark.serializer                 org.apache.spark.serializer.KryoSerializer
```
Eliminate BLAS threads. Spark running on YARN or standalone mode should avoid additional threads parallelism. Set the environment variables at the spark-env.sh file or set them at run time.
```
MKL_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
```
Oversubscribe resources. If we have a small cluster but many people are using it. Note that not every time the cluster is fully loaded. We could use an oversubscribing trick to improve the cluster's effciency, i.e. to allow for more jobs running simultaneously. It is often safe to set double amount of physical cores and/or total memory. Assume each worker node has 32 physical core and 64G RAM, we could double them in the spark-env.sh file.
```
SPARK_WORKER_CORES=64
SPARK_WORKER_MEMORY=128g
```

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
hadoop/etc/hadoop		hadoop/etc/hadoop
spark/conf		spark/conf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Hadoop and Spark config files

Prerequisites

Tuning Spark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Basic Hadoop and Spark config files

Prerequisites

Tuning Spark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages