How To Set Up SLURM on Linux
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is a workload manager. It handles job scheduling, resource allocation, and node management in clusters.
1. Controller (Master) Node
1.1 Install SLURM
sudo pacman -S slurm-llnl
1.2 Create SLURM User
This slurm user will be used to maintain ownership of the files and directories used for SLURM.
sudo useradd -r -s /usr/bin/nologin slurm
1.3 Configure slurm.conf
This file must be identical on all nodes.
ClusterName=cluster
SlurmctldHost=archlinux
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPort=6817
SlurmdPort=6818
SlurmUser=slurm
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=node1 NodeAddr=192.168.144.132 CPUs=2 RealMemory=3500 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=node2 NodeAddr=192.168.144.175 CPUs=2 RealMemory=3500 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP
Save this file to /etc/slurm-llnl/slurm.conf
1.4 Controller Directories
Now, to run SLURM smoothly, you need two directories.
- /var/spool --> for saving live state + recovery
- /var/log --> for history + debugging
sudo mkdir -p /var/log/slurm
sudo mkdir -p /var/spool/slurmctld
sudo touch /var/log/slurm/slurmctld.log
sudo chown -R slurm:slurm /var/spool/slurmctld /var/log/slurm
sudo chmod 750 /var/log/slurm /var/spool/slurmctld
sudo chmod 640 /var/log/slurm/*.log
1.5 Start slurmctld
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
2. Compute Nodes
2.1 Copy Configuration
sudo cp /etc/slurm-llnl/slurm.conf /srv/nfsroot/etc/slurm-llnl/slurm.conf
2.2 Create compute node directories
sudo mkdir -p /srv/nfsroot/var/spool/slurmd
sudo mkdir -p /srv/nfsroot/var/log/slurm
sudo touch /srv/nfsroot/var/log/slurm/slurmd.log
sudo chown -R slurm:slurm /srv/nfsroot/var/spool/slurmd /srv/nfsroot/var/log/slurm
sudo chmod 750 /srv/nfsroot/var/spool/slurmd /srv/nfsroot/var/log/slurm
sudo chmod 640 /srv/nfsroot/var/log/slurm/*.log
2.3 Start slurmd
sudo systemctl enable slurmd
sudo systemctl start slurmd
2.4 Verify Cluster
sudo scontrol update NodeName=node[1-2] State=RESUME
sinfo
scontrol show nodes