A fully functional HPC cluster running on a local machine using Multipass VMs. This project demonstrates cluster orchestration, job scheduling with Slurm, and distributed computing concepts including OpenMP and MPI workloads.
┌─────────────────────────────────────────────────────────────────┐
│ Host Machine │
│ (macOS / Apple Silicon) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ slurmctl │ Controller Node │
│ │ ┌─────────┐ │ - slurmctld (scheduler daemon) │
│ │ │slurmctld│ │ - munge (authentication) │
│ │ │ munge │ │ - NFS server (/shared) │
│ │ │ NFS │ │ - 2 vCPU, 2 GB RAM │
│ │ └─────────┘ │ │
│ └────────┬────────┘ │
│ │ │
│ │ Slurm RPC (6817/6818) │
│ │ NFS mount │
│ │ │
│ ┌────────┴────────┬──────────────────┐ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌─────────────┐ ┌─────────────┐ │ │
│ │ c1 │ │ c2 │ │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ │
│ │ │slurmd │ │ │ │slurmd │ │ Compute Nodes │
│ │ │ munge │ │ │ │ munge │ │ - slurmd (node daemon) │
│ │ │ NFS │ │ │ │ NFS │ │ - munge (authentication) │
│ │ └───────┘ │ │ └───────┘ │ - NFS client (/shared) │
│ │ 2 vCPU/2 GB │ │ 2 vCPU/2 GB │ - OpenMP / MPI capable │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Components:
| Node | Role | Services | Resources |
|---|---|---|---|
| slurmctl | Controller | slurmctld, munge, NFS server | 2 vCPU, 2 GB RAM, 20 GB disk |
| c1 | Compute | slurmd, munge, NFS client | 2 vCPU, 2 GB RAM, 20 GB disk |
| c2 | Compute | slurmd, munge, NFS client | 2 vCPU, 2 GB RAM, 20 GB disk |
- Prerequisites
- Phase 1: VM Provisioning
- Phase 2: Package Installation
- Phase 3: Munge Authentication
- Phase 4: NFS Shared Filesystem
- Phase 5: Slurm Configuration
- Phase 6: Validation Jobs
- Cluster Management
- Proof of Functionality
- Cleanup
- macOS with Apple Silicon or Intel processor
- At least 16 GB RAM
- 20 GB free disk space
- Homebrew installed
Install Multipass:
brew install --cask multipassVerify installation:
multipass versionLaunch three Ubuntu 22.04 VMs:
multipass launch 22.04 -n slurmctl -c 2 -m 2G -d 20G
multipass launch 22.04 -n c1 -c 2 -m 2G -d 20G
multipass launch 22.04 -n c2 -c 2 -m 2G -d 20GVerify all VMs are running:
multipass listExpected output:
Name State IPv4 Image
slurmctl Running 192.168.64.2 Ubuntu 22.04 LTS
c1 Running 192.168.64.3 Ubuntu 22.04 LTS
c2 Running 192.168.64.4 Ubuntu 22.04 LTS
CTL_IP=$(multipass info slurmctl | awk '/IPv4/{print $2; exit}')
C1_IP=$(multipass info c1 | awk '/IPv4/{print $2; exit}')
C2_IP=$(multipass info c2 | awk '/IPv4/{print $2; exit}')
echo "slurmctl=$CTL_IP c1=$C1_IP c2=$C2_IP"Set proper hostnames on each node:
multipass exec slurmctl -- sudo hostnamectl set-hostname slurmctl
multipass exec c1 -- sudo hostnamectl set-hostname c1
multipass exec c2 -- sudo hostnamectl set-hostname c2Add hostname resolution to all nodes:
for n in slurmctl c1 c2; do
multipass exec $n -- bash -lc "sudo tee -a /etc/hosts >/dev/null <<EOF
$CTL_IP slurmctl
$C1_IP c1
$C2_IP c2
EOF"
doneVerify connectivity:
multipass exec c1 -- ping -c 2 slurmctl
multipass exec c2 -- ping -c 2 slurmctlInstall Slurm controller packages and build tools:
multipass exec slurmctl -- bash -lc "
sudo apt-get update
sudo apt-get install -y munge slurm-wlm slurmctld slurm-client build-essential
"Install Slurm compute daemon and build tools:
for n in c1 c2; do
multipass exec $n -- bash -lc "
sudo apt-get update
sudo apt-get install -y munge slurm-wlm slurmd slurm-client build-essential
"
doneMunge provides authentication between Slurm components. All nodes must share the same key.
multipass exec slurmctl -- bash -lc "
sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024 2>/dev/null
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl enable --now munge
"multipass transfer slurmctl:/etc/munge/munge.key /tmp/munge.key
for n in c1 c2; do
multipass transfer /tmp/munge.key $n:/tmp/munge.key
multipass exec $n -- bash -lc "
sudo mv /tmp/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl enable --now munge
"
done
rm /tmp/munge.keymultipass exec slurmctl -- bash -lc "munge -n | unmunge"
multipass exec c1 -- bash -lc "munge -n | unmunge"Both should show STATUS: Success.
A shared filesystem allows job scripts and binaries to be accessible from all nodes.
multipass exec slurmctl -- bash -lc "
sudo apt-get install -y nfs-kernel-server
sudo mkdir -p /shared
sudo chown ubuntu:ubuntu /shared
echo '/shared *(rw,sync,no_subtree_check,no_root_squash)' | sudo tee /etc/exports
sudo exportfs -ra
sudo systemctl enable --now nfs-server
"for n in c1 c2; do
multipass exec $n -- bash -lc "
sudo apt-get install -y nfs-common
sudo mkdir -p /shared
echo 'slurmctl:/shared /shared nfs defaults 0 0' | sudo tee -a /etc/fstab
sudo mount -a
"
donemultipass exec slurmctl -- bash -lc "echo 'NFS test' > /shared/test.txt"
multipass exec c1 -- cat /shared/test.txt
multipass exec c2 -- cat /shared/test.txtBoth compute nodes should output NFS test.
multipass exec slurmctl -- bash -lc "sudo tee /etc/slurm/slurm.conf >/dev/null <<'EOF'
# Cluster identification
ClusterName=minicluster
SlurmctldHost=slurmctl
SlurmUser=slurm
# Authentication
AuthType=auth/munge
# State preservation
StateSaveLocation=/var/lib/slurm/slurmctld
SlurmdSpoolDir=/var/lib/slurm/slurmd
# Ports
SlurmctldPort=6817
SlurmdPort=6818
# Process tracking
ProctrackType=proctrack/linuxproc
SwitchType=switch/none
MpiDefault=none
# Node recovery
ReturnToService=2
# Timeouts
SlurmctldTimeout=120
SlurmdTimeout=300
# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
# Logging
SlurmctldDebug=info
SlurmdDebug=info
# Accounting (disabled for simplicity)
AccountingStorageType=accounting_storage/none
# Node definitions
NodeName=c1 CPUs=2 RealMemory=1900 State=UNKNOWN
NodeName=c2 CPUs=2 RealMemory=1900 State=UNKNOWN
# Partition definitions
PartitionName=debug Nodes=c1,c2 Default=YES MaxTime=00:20:00 State=UP
EOF"multipass exec slurmctl -- bash -lc "
sudo mkdir -p /var/lib/slurm/slurmctld /var/lib/slurm/slurmd
sudo chown -R slurm:slurm /var/lib/slurm
"
for n in c1 c2; do
multipass exec $n -- bash -lc "
sudo mkdir -p /var/lib/slurm/slurmd
sudo chown -R slurm:slurm /var/lib/slurm
"
donemultipass transfer slurmctl:/etc/slurm/slurm.conf /tmp/slurm.conf
for n in c1 c2; do
multipass transfer /tmp/slurm.conf $n:/tmp/slurm.conf
multipass exec $n -- sudo mv /tmp/slurm.conf /etc/slurm/slurm.conf
done
rm /tmp/slurm.confController:
multipass exec slurmctl -- sudo systemctl enable --now slurmctldCompute nodes:
for n in c1 c2; do
multipass exec $n -- sudo systemctl enable --now slurmd
donemultipass exec slurmctl -- sinfoExpected output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up 20:00 2 idle c1,c2
Check node details:
multipass exec slurmctl -- scontrol show nodesfor n in slurmctl c1 c2; do
multipass exec $n -- bash -lc "sudo apt-get install -y openmpi-bin libopenmpi-dev"
doneCopy the job scripts and source files to the shared filesystem:
# Copy job scripts
multipass transfer jobs/hello.sbatch slurmctl:/shared/hello.sbatch
multipass transfer jobs/omp.sbatch slurmctl:/shared/omp.sbatch
multipass transfer jobs/mpi.sbatch slurmctl:/shared/mpi.sbatch
# Copy and compile source files
multipass transfer src/omp_pi.c slurmctl:/shared/omp_pi.c
multipass transfer src/mpi_hello.c slurmctl:/shared/mpi_hello.c
multipass exec slurmctl -- bash -lc "
gcc -O3 -fopenmp /shared/omp_pi.c -o /shared/omp_pi
mpicc /shared/mpi_hello.c -o /shared/mpi_hello
"Simple batch job:
multipass exec slurmctl -- sbatch /shared/hello.sbatch
multipass exec slurmctl -- squeueOpenMP parallel job:
multipass exec slurmctl -- sbatch /shared/omp.sbatchMPI distributed job:
multipass exec slurmctl -- sbatch /shared/mpi.sbatchCheck job outputs:
multipass exec slurmctl -- bash -lc "ls -la /shared/*.out"
multipass exec slurmctl -- bash -lc "cat /shared/hello.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/omp.*.out"
multipass exec slurmctl -- bash -lc "cat /shared/mpi.*.out"| Command | Description |
|---|---|
sinfo |
View cluster and partition status |
sinfo -R |
Show reason for unavailable nodes |
squeue |
View job queue |
squeue -u $USER |
View your jobs |
scontrol show nodes |
Detailed node information |
scontrol show job <id> |
Detailed job information |
scancel <id> |
Cancel a job |
srun -N1 hostname |
Run interactive command on one node |
Check service status:
multipass exec slurmctl -- systemctl status slurmctld --no-pager
multipass exec c1 -- systemctl status slurmd --no-pagerRestart services:
multipass exec slurmctl -- sudo systemctl restart slurmctld
multipass exec c1 -- sudo systemctl restart slurmd
multipass exec c2 -- sudo systemctl restart slurmdRun the included health check script:
multipass transfer scripts/healthcheck.sh slurmctl:/shared/healthcheck.sh
multipass exec slurmctl -- bash /shared/healthcheck.shThe following screenshots demonstrate a fully operational cluster. All results are available in the Results directory.
Shows all 3 VMs running: slurmctl (controller), c1 and c2 (compute nodes)
sinfo output showing the debug partition with 2 idle compute nodes
Detailed node information including CPU count, memory, and state
Output from the simple batch job and OpenMP parallel pi calculation (2 threads, computed pi=3.141592653590)
MPI job running 4 processes across 2 nodes (ranks 0-1 on c1, ranks 2-3 on c2)
Munge credential encode/decode test showing successful cluster authentication
NFS mount verification on compute node showing shared filesystem from controller
multipass stop --allmultipass start --allmultipass delete slurmctl c1 c2
multipass purgeThis removes all VMs and reclaims disk space.
See runbook.md for detailed troubleshooting procedures.
slurm-mini-cluster/
├── README.md # This file
├── runbook.md # Troubleshooting guide
├── Diagrams/
│ └── architecture.md # Architecture documentation
├── Results/ # Screenshots proving cluster functionality
│ ├── Cluster Status (sinfo).png
│ ├── Node Details (scontrol show nodes).png
│ ├── VM List (multipass list).png
│ ├── Hello and OpenMP job Output.png
│ ├── MPI Job Output.png
│ ├── Munge Auth Test.png
│ └── NFS Mount Verification.png
├── jobs/
│ ├── hello.sbatch # Simple hostname job
│ ├── omp.sbatch # OpenMP parallel job
│ └── mpi.sbatch # MPI distributed job
├── scripts/
│ ├── healthcheck.sh # Cluster health monitoring
│ ├── collect-logs.sh # Log collection utility
│ └── setup-cluster.sh # Automated cluster setup
└── src/
├── omp_pi.c # OpenMP pi calculation
└── mpi_hello.c # MPI hello world






