There are two types of supercomputers: The ones that have batch systems and the ones that don’t. Batch systems provide a way to start applications on nodes of the cluster that have free resources. Without a batch system you would have to manually search for a node which is not used by another user to start your application. You can of course run your application on a node which is occupied by another user but this will most likely result in a slower execution of your application and the one of the other user. In the best case both applications will run fine – in the worst case, one may crash because there is not enough memory in the system. To avoid this, batch systems exist. There are different batch systems out there so you might have to ask your system administrator which one is installed on your cluster. Or you can try some commands to see which work on your super computer.
Searching for Free Nodes on a Cluster Without a Batch System
To login to a node one has to know the name of the node. If you don’t know how the nodes are called, ask your administrator – or take a look in the /etc/hosts file:
This should return a list of names and their IP addresses (numbers). Ignore the numbers.
Now that you know the names of the nodes, try to see if you can connect and run commands on this nodes:
ssh <node name> uptime
If you can connect and run commands you should see the output of the “uptime” command. This command tells you how long the node is powered on and (more importantly) how high the “load” on the node is:
The last three numbers tell you how many processes were using a full CPU core for the last 1, 5 and 15 minutes respectively. The higher the number, the more processes are running on the node.
If you found a node with a low “load” login to this node running
Here you can now start your application as described in Starting Applications on Linux – The Path Matters
Using a Batch System to Start Your Jobs on Nodes with Free Resources
As can be seen above, running applications on a cluster without a batch system is cumbersome and might result in overloaded nodes. If your cluster does not have a batch system, ask your administrator to install “Torque”. This is available for most Linux distributions free of charge and within their repositories.
All batch systems provide a command to show you the currently available resources and free nodes. To identify your batch system try the following commands:
- If “sinfo” returns a list of nodes and their state, you have a SLURM batch system.
- If “qstat -n” returns a list of nodes and their state, you have a PBS/Torque batch system.
- If “llclass” returns a list of nodes and their state, you have a IBM LoadLeveler batch system.
- If “qhost” returns a list of nodes and their state, you have a GridScheduler batch system.
Use the link above to learn how to start applications on free nodes with the batch system installed on your cluster.