The SLURM Batch System
The Slurm Workload Manager is an open source batch system. In contrast to other batch systems, SLURM organizes nodes within partitions. One node can be part of multiple partitions. The partitions may have different restrictions. It is used by more then 50% of the Top500 supercomputers.
Status of the Supercomputer
With SLURM you can display the current state of your cluster with:
- This shows information on the nodes and available partitions
- This shows information on the currently running and waiting jobs
Batch systems handle “jobs” or “batches”. Usually a job contains a list of resources it needs and a list of commands which should be executed on this resources. For example, a typical Gromacs simulation needs 16 CPU cores, 32GB of memory and 2 hours of computation time. The command to start the simulation is “mdrun”.
Starting Interactive Jobs with SLURM
You can now request an interactive session from the batch system. For the SLURM batch system the command would be:
salloc –time=02:00:00 –ntasks=16 –mem=32768
SLURM will now try to allocate the resources you requested. Depending on the utilization of your HPC cluster, this can take anywhere between some seconds and days:
salloc: Pending job allocation 17295
salloc: job 17295 queued and waiting for resources
When the resources are available and allocated for you SLURM eventually responses with:
salloc: Granted job allocation 17295
You can now run the Gromacs simulation with
When the simulation finished within the requested two hours, you can free the resources simply by typing
salloc: Relinquishing job allocation 17295
salloc: Job allocation 17295 has been revoked.
If the simulation did not finish, SLURM will cancel your job shortly after the two hour deadline.
Starting Non-Interactive Jobs with SLURM
If you want to start multiple jobs or don’t want to wait until there are free resources for an interactive job write job scripts! A simple SLURM job script for the Gromacs simulation above would look like this:
Safe it as “GromacsSim1.sbatch” and submit it to the batch system with
To see the status of your job type
If the job is not listened, it finished. You should find log files of the job in the directory you submitted the job from.
Stopping SLURM Jobs
To cancel a job run
you can also cancel all jobs of your user with
scancel -u <user name>