Skip to content

KrxGu/slurm-mini-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Slurm Mini-Cluster

A fully functional HPC cluster running on a local machine using Multipass VMs. This project demonstrates cluster orchestration, job scheduling with Slurm, and distributed computing concepts including OpenMP and MPI workloads.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Host Machine                             │
│                     (macOS / Apple Silicon)                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────┐                                           │
│   │    slurmctl     │  Controller Node                          │
│   │   ┌─────────┐   │  - slurmctld (scheduler daemon)           │
│   │   │slurmctld│   │  - munge (authentication)                 │
│   │   │  munge  │   │  - NFS server (/shared)                   │
│   │   │   NFS   │   │  - 2 vCPU, 2 GB RAM                       │ 
│   │   └─────────┘   │                                           │
│   └────────┬────────┘                                           │
│            │                                                    │
│            │ Slurm RPC (6817/6818)                              │
│            │ NFS mount                                          │
│            │                                                    │
│   ┌────────┴────────┬──────────────────┐                        │
│   │                 │                  │                        │
│   ▼                 ▼                  │                        │
│ ┌─────────────┐  ┌─────────────┐       │                        │
│ │     c1      │  │     c2      │       │                        │
│ │  ┌───────┐  │  │  ┌───────┐  │       │                        │
│ │  │slurmd │  │  │  │slurmd │  │  Compute Nodes                 │
│ │  │ munge │  │  │  │ munge │  │  - slurmd (node daemon)        │
│ │  │  NFS  │  │  │  │  NFS  │  │  - munge (authentication)      │
│ │  └───────┘  │  │  └───────┘  │  - NFS client (/shared)        │
│ │ 2 vCPU/2 GB │  │ 2 vCPU/2 GB │  - OpenMP / MPI capable        │
│ └─────────────┘  └─────────────┘                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Components:

Node Role Services Resources
slurmctl Controller slurmctld, munge, NFS server 2 vCPU, 2 GB RAM, 20 GB disk
c1 Compute slurmd, munge, NFS client 2 vCPU, 2 GB RAM, 20 GB disk
c2 Compute slurmd, munge, NFS client 2 vCPU, 2 GB RAM, 20 GB disk

Table of Contents

  1. Prerequisites
  2. Phase 1: VM Provisioning
  3. Phase 2: Package Installation
  4. Phase 3: Munge Authentication
  5. Phase 4: NFS Shared Filesystem
  6. Phase 5: Slurm Configuration
  7. Phase 6: Validation Jobs
  8. Cluster Management
  9. Proof of Functionality
  10. Cleanup

Prerequisites

  • macOS with Apple Silicon or Intel processor
  • At least 16 GB RAM
  • 20 GB free disk space
  • Homebrew installed

Install Multipass:

brew install --cask multipass

Verify installation:

multipass version

Phase 1: VM Provisioning

Create the Virtual Machines

Launch three Ubuntu 22.04 VMs:

multipass launch 22.04 -n slurmctl -c 2 -m 2G -d 20G
multipass launch 22.04 -n c1       -c 2 -m 2G -d 20G
multipass launch 22.04 -n c2       -c 2 -m 2G -d 20G

Verify all VMs are running:

multipass list

Expected output:

Name                    State             IPv4             Image
slurmctl                Running           192.168.64.2     Ubuntu 22.04 LTS
c1                      Running           192.168.64.3     Ubuntu 22.04 LTS
c2                      Running           192.168.64.4     Ubuntu 22.04 LTS

Capture IP Addresses

CTL_IP=$(multipass info slurmctl | awk '/IPv4/{print $2; exit}')
C1_IP=$(multipass info c1       | awk '/IPv4/{print $2; exit}')
C2_IP=$(multipass info c2       | awk '/IPv4/{print $2; exit}')
echo "slurmctl=$CTL_IP c1=$C1_IP c2=$C2_IP"

Configure Hostnames

Set proper hostnames on each node:

multipass exec slurmctl -- sudo hostnamectl set-hostname slurmctl
multipass exec c1       -- sudo hostnamectl set-hostname c1
multipass exec c2       -- sudo hostnamectl set-hostname c2

Configure /etc/hosts

Add hostname resolution to all nodes:

for n in slurmctl c1 c2; do
  multipass exec $n -- bash -lc "sudo tee -a /etc/hosts >/dev/null <<EOF
$CTL_IP slurmctl
$C1_IP c1
$C2_IP c2
EOF"
done

Verify connectivity:

multipass exec c1 -- ping -c 2 slurmctl
multipass exec c2 -- ping -c 2 slurmctl

Phase 2: Package Installation

Controller Node

Install Slurm controller packages and build tools:

multipass exec slurmctl -- bash -lc "
  sudo apt-get update
  sudo apt-get install -y munge slurm-wlm slurmctld slurm-client build-essential
"

Compute Nodes

Install Slurm compute daemon and build tools:

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo apt-get update
    sudo apt-get install -y munge slurm-wlm slurmd slurm-client build-essential
  "
done

Phase 3: Munge Authentication

Munge provides authentication between Slurm components. All nodes must share the same key.

Generate Key on Controller

multipass exec slurmctl -- bash -lc "
  sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024 2>/dev/null
  sudo chown munge:munge /etc/munge/munge.key
  sudo chmod 400 /etc/munge/munge.key
  sudo systemctl enable --now munge
"

Distribute Key to Compute Nodes

multipass transfer slurmctl:/etc/munge/munge.key /tmp/munge.key

for n in c1 c2; do
  multipass transfer /tmp/munge.key $n:/tmp/munge.key
  multipass exec $n -- bash -lc "
    sudo mv /tmp/munge.key /etc/munge/munge.key
    sudo chown munge:munge /etc/munge/munge.key
    sudo chmod 400 /etc/munge/munge.key
    sudo systemctl enable --now munge
  "
done

rm /tmp/munge.key

Verify Munge Authentication

multipass exec slurmctl -- bash -lc "munge -n | unmunge"
multipass exec c1       -- bash -lc "munge -n | unmunge"

Both should show STATUS: Success.


Phase 4: NFS Shared Filesystem

A shared filesystem allows job scripts and binaries to be accessible from all nodes.

Configure NFS Server (Controller)

multipass exec slurmctl -- bash -lc "
  sudo apt-get install -y nfs-kernel-server
  sudo mkdir -p /shared
  sudo chown ubuntu:ubuntu /shared
  echo '/shared *(rw,sync,no_subtree_check,no_root_squash)' | sudo tee /etc/exports
  sudo exportfs -ra
  sudo systemctl enable --now nfs-server
"

Mount NFS on Compute Nodes

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo apt-get install -y nfs-common
    sudo mkdir -p /shared
    echo 'slurmctl:/shared /shared nfs defaults 0 0' | sudo tee -a /etc/fstab
    sudo mount -a
  "
done

Verify NFS Mount

multipass exec slurmctl -- bash -lc "echo 'NFS test' > /shared/test.txt"
multipass exec c1 -- cat /shared/test.txt
multipass exec c2 -- cat /shared/test.txt

Both compute nodes should output NFS test.


Phase 5: Slurm Configuration

Create slurm.conf on Controller

multipass exec slurmctl -- bash -lc "sudo tee /etc/slurm/slurm.conf >/dev/null <<'EOF'
# Cluster identification
ClusterName=minicluster
SlurmctldHost=slurmctl
SlurmUser=slurm

# Authentication
AuthType=auth/munge

# State preservation
StateSaveLocation=/var/lib/slurm/slurmctld
SlurmdSpoolDir=/var/lib/slurm/slurmd

# Ports
SlurmctldPort=6817
SlurmdPort=6818

# Process tracking
ProctrackType=proctrack/linuxproc
SwitchType=switch/none
MpiDefault=none

# Node recovery
ReturnToService=2

# Timeouts
SlurmctldTimeout=120
SlurmdTimeout=300

# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# Logging
SlurmctldDebug=info
SlurmdDebug=info

# Accounting (disabled for simplicity)
AccountingStorageType=accounting_storage/none

# Node definitions
NodeName=c1 CPUs=2 RealMemory=1900 State=UNKNOWN
NodeName=c2 CPUs=2 RealMemory=1900 State=UNKNOWN

# Partition definitions
PartitionName=debug Nodes=c1,c2 Default=YES MaxTime=00:20:00 State=UP
EOF"

Create State Directories

multipass exec slurmctl -- bash -lc "
  sudo mkdir -p /var/lib/slurm/slurmctld /var/lib/slurm/slurmd
  sudo chown -R slurm:slurm /var/lib/slurm
"

for n in c1 c2; do
  multipass exec $n -- bash -lc "
    sudo mkdir -p /var/lib/slurm/slurmd
    sudo chown -R slurm:slurm /var/lib/slurm
  "
done

Distribute Configuration

multipass transfer slurmctl:/etc/slurm/slurm.conf /tmp/slurm.conf

for n in c1 c2; do
  multipass transfer /tmp/slurm.conf $n:/tmp/slurm.conf
  multipass exec $n -- sudo mv /tmp/slurm.conf /etc/slurm/slurm.conf
done

rm /tmp/slurm.conf

Start Slurm Services

Controller:

multipass exec slurmctl -- sudo systemctl enable --now slurmctld

Compute nodes:

for n in c1 c2; do
  multipass exec $n -- sudo systemctl enable --now slurmd
done

Verify Cluster Status

multipass exec slurmctl -- sinfo

Expected output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up      20:00      2   idle c1,c2

Check node details:

multipass exec slurmctl -- scontrol show nodes

Phase 6: Validation Jobs

Install OpenMPI (for MPI jobs)

for n in slurmctl c1 c2; do
  multipass exec $n -- bash -lc "sudo apt-get install -y openmpi-bin libopenmpi-dev"
done

Deploy Job Files

Copy the job scripts and source files to the shared filesystem:

# Copy job scripts
multipass transfer jobs/hello.sbatch slurmctl:/shared/hello.sbatch
multipass transfer jobs/omp.sbatch slurmctl:/shared/omp.sbatch
multipass transfer jobs/mpi.sbatch slurmctl:/shared/mpi.sbatch

# Copy and compile source files
multipass transfer src/omp_pi.c slurmctl:/shared/omp_pi.c
multipass transfer src/mpi_hello.c slurmctl:/shared/mpi_hello.c

multipass exec slurmctl -- bash -lc "
  gcc -O3 -fopenmp /shared/omp_pi.c -o /shared/omp_pi
  mpicc /shared/mpi_hello.c -o /shared/mpi_hello
"

Run Validation Jobs

Simple batch job:

multipass exec slurmctl -- sbatch /shared/hello.sbatch
multipass exec slurmctl -- squeue

OpenMP parallel job:

multipass exec slurmctl -- sbatch /shared/omp.sbatch

MPI distributed job:

multipass exec slurmctl -- sbatch /shared/mpi.sbatch

Check job outputs:

multipass exec slurmctl -- bash -lc "ls -la /shared/*.out"
multipass exec slurmctl -- bash -lc "cat /shared/hello.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/omp.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/mpi.*.out"

Cluster Management

Common Slurm Commands

Command Description
sinfo View cluster and partition status
sinfo -R Show reason for unavailable nodes
squeue View job queue
squeue -u $USER View your jobs
scontrol show nodes Detailed node information
scontrol show job <id> Detailed job information
scancel <id> Cancel a job
srun -N1 hostname Run interactive command on one node

Service Management

Check service status:

multipass exec slurmctl -- systemctl status slurmctld --no-pager
multipass exec c1 -- systemctl status slurmd --no-pager

Restart services:

multipass exec slurmctl -- sudo systemctl restart slurmctld
multipass exec c1 -- sudo systemctl restart slurmd
multipass exec c2 -- sudo systemctl restart slurmd

Health Check

Run the included health check script:

multipass transfer scripts/healthcheck.sh slurmctl:/shared/healthcheck.sh
multipass exec slurmctl -- bash /shared/healthcheck.sh

Proof of Functionality

The following screenshots demonstrate a fully operational cluster. All results are available in the Results directory.

VM Infrastructure

VM List

Shows all 3 VMs running: slurmctl (controller), c1 and c2 (compute nodes)

Cluster Status

Cluster Status

sinfo output showing the debug partition with 2 idle compute nodes

Node Details

Node Details

Detailed node information including CPU count, memory, and state

Job Outputs

Hello and OpenMP Jobs

Output from the simple batch job and OpenMP parallel pi calculation (2 threads, computed pi=3.141592653590)

MPI Job

MPI job running 4 processes across 2 nodes (ranks 0-1 on c1, ranks 2-3 on c2)

Authentication and Filesystem

Munge Authentication

Munge credential encode/decode test showing successful cluster authentication

NFS Mount

NFS mount verification on compute node showing shared filesystem from controller


Cleanup

Stop VMs (preserves state)

multipass stop --all

Restart VMs

multipass start --all

Complete Removal

multipass delete slurmctl c1 c2
multipass purge

This removes all VMs and reclaims disk space.


Troubleshooting

See runbook.md for detailed troubleshooting procedures.


Repository Structure

slurm-mini-cluster/
├── README.md                 # This file
├── runbook.md                # Troubleshooting guide
├── Diagrams/
│   └── architecture.md       # Architecture documentation
├── Results/                  # Screenshots proving cluster functionality
│   ├── Cluster Status (sinfo).png
│   ├──  Node Details (scontrol show nodes).png
│   ├── VM List (multipass list).png
│   ├── Hello and OpenMP job Output.png
│   ├── MPI Job Output.png
│   ├── Munge Auth Test.png
│   └── NFS Mount Verification.png
├── jobs/
│   ├── hello.sbatch          # Simple hostname job
│   ├── omp.sbatch            # OpenMP parallel job
│   └── mpi.sbatch            # MPI distributed job
├── scripts/
│   ├── healthcheck.sh        # Cluster health monitoring
│   ├── collect-logs.sh       # Log collection utility
│   └── setup-cluster.sh      # Automated cluster setup
└── src/
    ├── omp_pi.c              # OpenMP pi calculation
    └── mpi_hello.c           # MPI hello world

References

About

Local HPC cluster setup with Slurm job scheduler, featuring complete configuration guide and operational runbook.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors