Batch Systems – Guardians of the Supercomputer

Batch Systems – Guardians of the Supercomputer

In most cases the nodes of your supercomputer will not be sufficient to run all the calculations of all users at the same time. Thus, someone has to schedule the calculations to the nodes such that all users get a fair share of computation time and the supercomputer is used as efficient as possible.

If you share your supercomputer with one or two colleagues one might be able to do this manually by mailing or calling each other. However, for a supercomputer with a dozens or even hundreds of users this is a task to be handled by software – the batch system.

Batch systems handle “jobs” or  “batches”.  Usually a job contains a list of resources it needs and a list of commands which should be executed on this resources. Most batch systems support interactive and non-interactive jobs. Within an interactive job you start your applications, wait for the completion and react on the results. A non-interactive job usually is a script with the commands to be run. The advantage of non-interactive jobs: They can run at night, while you sleep.

Beside managing jobs, the batch system can also show you various status information of your cluster. Most batch systems can show you how many resources are in use or which are currently free. They can also show you when your jobs will start or will be finished. If you need to run a interactive job, they can can reserve the resources for that run for the  next day – if you tell them.

The most commonly batch systems found on clusters today are

  • SLURM
  • PBS / Torque
  • IBM LoadLeveler
  • GridScheduler

Follow the links above for a short introduction of and usage guide for each batch system.

 

Leave a Reply

Your email address will not be published. Required fields are marked *