The SLURM Batch System

The SLURM Batch System

The Slurm Workload Manager is an open source batch system. In contrast to other batch systems, SLURM organizes nodes within partitions. One node can be part of multiple partitions. The partitions may have different restrictions. It is used by more then 50% of the Top500 supercomputers.

Status of the Supercomputer

With SLURM you can display the current state of your cluster with:

  • sinfo
    • This shows information on the nodes and available partitions
  • squeue
    • This shows information on the currently running and waiting jobs

Jobs

Batch systems handle “jobs” or  “batches”.  Usually a job contains a list of resources it needs and a list of commands which should be executed on this resources. For example, a typical Gromacs simulation needs 16 CPU cores, 32GB of memory and 2 hours of computation time. The command to start the simulation is “mdrun”.

Starting Interactive Jobs with SLURM

You can now request an interactive session from the batch system. For the SLURM batch system the command would be:

salloc –time=02:00:00 –ntasks=16 –mem=32768

SLURM will now try to allocate the resources you requested. Depending on the utilization of your HPC cluster, this can take anywhere between some seconds and days:

salloc: Pending job allocation 17295
salloc: job 17295 queued and waiting for resources

When the resources are available and allocated for you SLURM eventually responses with:

salloc: Granted job allocation 17295

You can now run the Gromacs simulation with

mdrun

When the simulation finished within the requested two hours, you can free the resources simply by typing

exit

salloc: Relinquishing job allocation 17295
salloc: Job allocation 17295 has been revoked.

If the simulation did not finish, SLURM will cancel your job shortly after the two hour deadline.

Starting Non-Interactive Jobs with SLURM

If you want to start multiple jobs or don’t want to wait until there are free resources for an interactive job write job scripts! A simple SLURM job script for the Gromacs simulation above would look like this:

#!/bin/bash
#SBATCH –job-name=GromacsSim1
#SBATCH –time=2:00:00
#SBATCH –ntasks=16
#SBATCH –mem=32768

cd gromacs/sim1/
mdrun

Safe it as “GromacsSim1.sbatch” and submit it to the batch system with

sbatch GromacsSim1.sbatch

To see the status of your job type

squeue

If the job is not listened, it finished. You should find log files of the job in the directory you submitted the job from.

Stopping SLURM Jobs

To cancel a job run

scancel <jobid>

you can also cancel all jobs of your user with

scancel -u <user name>

 

Leave a Reply

Your email address will not be published. Required fields are marked *