Slurm Mini-Cluster

A fully functional HPC cluster running on a local machine using Multipass VMs. This project demonstrates cluster orchestration, job scheduling with Slurm, and distributed computing concepts including OpenMP and MPI workloads.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Host Machine                             │
│                     (macOS / Apple Silicon)                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────┐                                           │
│   │    slurmctl     │  Controller Node                          │
│   │   ┌─────────┐   │  - slurmctld (scheduler daemon)           │
│   │   │slurmctld│   │  - munge (authentication)                 │
│   │   │  munge  │   │  - NFS server (/shared)                   │
│   │   │   NFS   │   │  - 2 vCPU, 2 GB RAM                       │ 
│   │   └─────────┘   │                                           │
│   └────────┬────────┘                                           │
│            │                                                    │
│            │ Slurm RPC (6817/6818)                              │
│            │ NFS mount                                          │
│            │                                                    │
│   ┌────────┴────────┬──────────────────┐                        │
│   │                 │                  │                        │
│   ▼                 ▼                  │                        │
│ ┌─────────────┐  ┌─────────────┐       │                        │
│ │     c1      │  │     c2      │       │                        │
│ │  ┌───────┐  │  │  ┌───────┐  │       │                        │
│ │  │slurmd │  │  │  │slurmd │  │  Compute Nodes                 │
│ │  │ munge │  │  │  │ munge │  │  - slurmd (node daemon)        │
│ │  │  NFS  │  │  │  │  NFS  │  │  - munge (authentication)      │
│ │  └───────┘  │  │  └───────┘  │  - NFS client (/shared)        │
│ │ 2 vCPU/2 GB │  │ 2 vCPU/2 GB │  - OpenMP / MPI capable        │
│ └─────────────┘  └─────────────┘                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Components:

Node	Role	Services	Resources
slurmctl	Controller	slurmctld, munge, NFS server	2 vCPU, 2 GB RAM, 20 GB disk
c1	Compute	slurmd, munge, NFS client	2 vCPU, 2 GB RAM, 20 GB disk
c2	Compute	slurmd, munge, NFS client	2 vCPU, 2 GB RAM, 20 GB disk

Prerequisites

macOS with Apple Silicon or Intel processor
At least 16 GB RAM
20 GB free disk space
Homebrew installed

Install Multipass:

brew install --cask multipass

Verify installation:

multipass version

Phase 1: VM Provisioning

Create the Virtual Machines

Launch three Ubuntu 22.04 VMs:

multipass launch 22.04 -n slurmctl -c 2 -m 2G -d 20G
multipass launch 22.04 -n c1       -c 2 -m 2G -d 20G
multipass launch 22.04 -n c2       -c 2 -m 2G -d 20G

Verify all VMs are running:

multipass list

Expected output:

Name                    State             IPv4             Image
slurmctl                Running           192.168.64.2     Ubuntu 22.04 LTS
c1                      Running           192.168.64.3     Ubuntu 22.04 LTS
c2                      Running           192.168.64.4     Ubuntu 22.04 LTS

Capture IP Addresses

CTL_IP=$(multipass info slurmctl | awk '/IPv4/{print $2; exit}')
C1_IP=$(multipass info c1       | awk '/IPv4/{print $2; exit}')
C2_IP=$(multipass info c2       | awk '/IPv4/{print $2; exit}')
echo "slurmctl=$CTL_IP c1=$C1_IP c2=$C2_IP"

Configure Hostnames

Set proper hostnames on each node:

multipass exec slurmctl -- sudo hostnamectl set-hostname slurmctl
multipass exec c1       -- sudo hostnamectl set-hostname c1
multipass exec c2       -- sudo hostnamectl set-hostname c2

Configure /etc/hosts

Add hostname resolution to all nodes:

for n in slurmctl c1 c2; do
  multipass exec $n -- bash -lc "sudo tee -a /etc/hosts >/dev/null <<EOF
$CTL_IP slurmctl
$C1_IP c1
$C2_IP c2
EOF"
done

Verify connectivity:

multipass exec c1 -- ping -c 2 slurmctl
multipass exec c2 -- ping -c 2 slurmctl

Phase 2: Package Installation

Controller Node

Install Slurm controller packages and build tools:

multipass exec slurmctl -- bash -lc "
  sudo apt-get update
  sudo apt-get install -y munge slurm-wlm slurmctld slurm-client build-essential
"

Compute Nodes

Install Slurm compute daemon and build tools:

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo apt-get update
    sudo apt-get install -y munge slurm-wlm slurmd slurm-client build-essential
  "
done

Phase 3: Munge Authentication

Munge provides authentication between Slurm components. All nodes must share the same key.

Generate Key on Controller

multipass exec slurmctl -- bash -lc "
  sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024 2>/dev/null
  sudo chown munge:munge /etc/munge/munge.key
  sudo chmod 400 /etc/munge/munge.key
  sudo systemctl enable --now munge
"

Distribute Key to Compute Nodes

multipass transfer slurmctl:/etc/munge/munge.key /tmp/munge.key

for n in c1 c2; do
  multipass transfer /tmp/munge.key $n:/tmp/munge.key
  multipass exec $n -- bash -lc "
    sudo mv /tmp/munge.key /etc/munge/munge.key
    sudo chown munge:munge /etc/munge/munge.key
    sudo chmod 400 /etc/munge/munge.key
    sudo systemctl enable --now munge
  "
done

rm /tmp/munge.key

Verify Munge Authentication

multipass exec slurmctl -- bash -lc "munge -n | unmunge"
multipass exec c1       -- bash -lc "munge -n | unmunge"

Both should show STATUS: Success.

Phase 4: NFS Shared Filesystem

A shared filesystem allows job scripts and binaries to be accessible from all nodes.

Configure NFS Server (Controller)

multipass exec slurmctl -- bash -lc "
  sudo apt-get install -y nfs-kernel-server
  sudo mkdir -p /shared
  sudo chown ubuntu:ubuntu /shared
  echo '/shared *(rw,sync,no_subtree_check,no_root_squash)' | sudo tee /etc/exports
  sudo exportfs -ra
  sudo systemctl enable --now nfs-server
"

Mount NFS on Compute Nodes

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo apt-get install -y nfs-common
    sudo mkdir -p /shared
    echo 'slurmctl:/shared /shared nfs defaults 0 0' | sudo tee -a /etc/fstab
    sudo mount -a
  "
done

Verify NFS Mount

multipass exec slurmctl -- bash -lc "echo 'NFS test' > /shared/test.txt"
multipass exec c1 -- cat /shared/test.txt
multipass exec c2 -- cat /shared/test.txt

Both compute nodes should output NFS test.

Phase 5: Slurm Configuration

Create slurm.conf on Controller

multipass exec slurmctl -- bash -lc "sudo tee /etc/slurm/slurm.conf >/dev/null <<'EOF'
# Cluster identification
ClusterName=minicluster
SlurmctldHost=slurmctl
SlurmUser=slurm

# Authentication
AuthType=auth/munge

# State preservation
StateSaveLocation=/var/lib/slurm/slurmctld
SlurmdSpoolDir=/var/lib/slurm/slurmd

# Ports
SlurmctldPort=6817
SlurmdPort=6818

# Process tracking
ProctrackType=proctrack/linuxproc
SwitchType=switch/none
MpiDefault=none

# Node recovery
ReturnToService=2

# Timeouts
SlurmctldTimeout=120
SlurmdTimeout=300

# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# Logging
SlurmctldDebug=info
SlurmdDebug=info

# Accounting (disabled for simplicity)
AccountingStorageType=accounting_storage/none

# Node definitions
NodeName=c1 CPUs=2 RealMemory=1900 State=UNKNOWN
NodeName=c2 CPUs=2 RealMemory=1900 State=UNKNOWN

# Partition definitions
PartitionName=debug Nodes=c1,c2 Default=YES MaxTime=00:20:00 State=UP
EOF"

Create State Directories

multipass exec slurmctl -- bash -lc "
  sudo mkdir -p /var/lib/slurm/slurmctld /var/lib/slurm/slurmd
  sudo chown -R slurm:slurm /var/lib/slurm
"

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo mkdir -p /var/lib/slurm/slurmd
    sudo chown -R slurm:slurm /var/lib/slurm
  "
done

Distribute Configuration

multipass transfer slurmctl:/etc/slurm/slurm.conf /tmp/slurm.conf

for n in c1 c2; do
  multipass transfer /tmp/slurm.conf $n:/tmp/slurm.conf
  multipass exec $n -- sudo mv /tmp/slurm.conf /etc/slurm/slurm.conf
done

rm /tmp/slurm.conf

Start Slurm Services

Controller:

multipass exec slurmctl -- sudo systemctl enable --now slurmctld

Compute nodes:

for n in c1 c2; do
  multipass exec $n -- sudo systemctl enable --now slurmd
done

Verify Cluster Status

multipass exec slurmctl -- sinfo

Expected output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up      20:00      2   idle c1,c2

Check node details:

multipass exec slurmctl -- scontrol show nodes

Phase 6: Validation Jobs

Install OpenMPI (for MPI jobs)

for n in slurmctl c1 c2; do
  multipass exec $n -- bash -lc "sudo apt-get install -y openmpi-bin libopenmpi-dev"
done

Deploy Job Files

Copy the job scripts and source files to the shared filesystem:

# Copy job scripts
multipass transfer jobs/hello.sbatch slurmctl:/shared/hello.sbatch
multipass transfer jobs/omp.sbatch slurmctl:/shared/omp.sbatch
multipass transfer jobs/mpi.sbatch slurmctl:/shared/mpi.sbatch

# Copy and compile source files
multipass transfer src/omp_pi.c slurmctl:/shared/omp_pi.c
multipass transfer src/mpi_hello.c slurmctl:/shared/mpi_hello.c

multipass exec slurmctl -- bash -lc "
  gcc -O3 -fopenmp /shared/omp_pi.c -o /shared/omp_pi
  mpicc /shared/mpi_hello.c -o /shared/mpi_hello
"

Run Validation Jobs

Simple batch job:

multipass exec slurmctl -- sbatch /shared/hello.sbatch
multipass exec slurmctl -- squeue

OpenMP parallel job:

multipass exec slurmctl -- sbatch /shared/omp.sbatch

MPI distributed job:

multipass exec slurmctl -- sbatch /shared/mpi.sbatch

Check job outputs:

multipass exec slurmctl -- bash -lc "ls -la /shared/*.out"
multipass exec slurmctl -- bash -lc "cat /shared/hello.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/omp.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/mpi.*.out"

Cluster Management

Common Slurm Commands

Command	Description
`sinfo`	View cluster and partition status
`sinfo -R`	Show reason for unavailable nodes
`squeue`	View job queue
`squeue -u $USER`	View your jobs
`scontrol show nodes`	Detailed node information
`scontrol show job <id>`	Detailed job information
`scancel <id>`	Cancel a job
`srun -N1 hostname`	Run interactive command on one node

Service Management

Check service status:

multipass exec slurmctl -- systemctl status slurmctld --no-pager
multipass exec c1 -- systemctl status slurmd --no-pager

Restart services:

multipass exec slurmctl -- sudo systemctl restart slurmctld
multipass exec c1 -- sudo systemctl restart slurmd
multipass exec c2 -- sudo systemctl restart slurmd

Health Check

Run the included health check script:

multipass transfer scripts/healthcheck.sh slurmctl:/shared/healthcheck.sh
multipass exec slurmctl -- bash /shared/healthcheck.sh

Proof of Functionality

The following screenshots demonstrate a fully operational cluster. All results are available in the Results directory.

VM Infrastructure

Shows all 3 VMs running: slurmctl (controller), c1 and c2 (compute nodes)

Cluster Status

sinfo output showing the debug partition with 2 idle compute nodes

Node Details

Detailed node information including CPU count, memory, and state

Job Outputs

Output from the simple batch job and OpenMP parallel pi calculation (2 threads, computed pi=3.141592653590)

MPI job running 4 processes across 2 nodes (ranks 0-1 on c1, ranks 2-3 on c2)

Authentication and Filesystem

Munge credential encode/decode test showing successful cluster authentication

NFS mount verification on compute node showing shared filesystem from controller

Cleanup

Stop VMs (preserves state)

multipass stop --all

Restart VMs

multipass start --all

Complete Removal

multipass delete slurmctl c1 c2
multipass purge

This removes all VMs and reclaims disk space.

Troubleshooting

See runbook.md for detailed troubleshooting procedures.

Repository Structure

slurm-mini-cluster/
├── README.md                 # This file
├── runbook.md                # Troubleshooting guide
├── Diagrams/
│   └── architecture.md       # Architecture documentation
├── Results/                  # Screenshots proving cluster functionality
│   ├── Cluster Status (sinfo).png
│   ├──  Node Details (scontrol show nodes).png
│   ├── VM List (multipass list).png
│   ├── Hello and OpenMP job Output.png
│   ├── MPI Job Output.png
│   ├── Munge Auth Test.png
│   └── NFS Mount Verification.png
├── jobs/
│   ├── hello.sbatch          # Simple hostname job
│   ├── omp.sbatch            # OpenMP parallel job
│   └── mpi.sbatch            # MPI distributed job
├── scripts/
│   ├── healthcheck.sh        # Cluster health monitoring
│   ├── collect-logs.sh       # Log collection utility
│   └── setup-cluster.sh      # Automated cluster setup
└── src/
    ├── omp_pi.c              # OpenMP pi calculation
    └── mpi_hello.c           # MPI hello world

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Diagrams		Diagrams
Results		Results
jobs		jobs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
runbook.md		runbook.md

Folders and files

Latest commit

History

Repository files navigation

Slurm Mini-Cluster

Architecture

Table of Contents

Prerequisites

Phase 1: VM Provisioning

Create the Virtual Machines

Capture IP Addresses

Configure Hostnames

Configure /etc/hosts

Phase 2: Package Installation

Controller Node

Compute Nodes

Phase 3: Munge Authentication

Generate Key on Controller

Distribute Key to Compute Nodes

Verify Munge Authentication

Phase 4: NFS Shared Filesystem

Configure NFS Server (Controller)

Mount NFS on Compute Nodes

Verify NFS Mount

Phase 5: Slurm Configuration

Create slurm.conf on Controller

Create State Directories

Distribute Configuration

Start Slurm Services

Verify Cluster Status

Phase 6: Validation Jobs

Install OpenMPI (for MPI jobs)

Deploy Job Files

Run Validation Jobs

Cluster Management

Common Slurm Commands

Service Management

Health Check

Proof of Functionality

VM Infrastructure

Cluster Status

Node Details

Job Outputs

Authentication and Filesystem

Cleanup

Stop VMs (preserves state)

Restart VMs

Complete Removal

Troubleshooting

Repository Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages