Last update: 20241120

Monitoring jobs and node configurations on slurm

Slurm schedules jobs in a fair and orderly manner, taking into account various factors to optimize the use of available resources. If all nodes are fully occupied with long-running jobs, it may not be possible to expedite the scheduling of new jobs. However, such situations are relatively rare, as Slurm’s scheduling algorithms are designed to maximize efficiency and minimize wait times. This ensures that all users get fair access to the resources in a timely manner.

To effectively manage and monitor jobs on a Slurm cluster, it’s important to know a few basic commands that provide insights into job status, queue details, and node configurations. Here are two essential commands:

Commands

squeue: This command displays information about jobs queued and running.
sinfo: This command shows the status of the partitions and nodes in the cluster.

Understanding `squeue` uutput

The following table explains the output columns of the squeue command:

Column	Description
JOBID	Unique identifier for each job
PARTITION	The partition (queue) the job is running on
NAME	The name of the job
USER	The username of the job owner
ST	The state of the job (R for running, PD for pending)
TIME	The time the job has been running or waiting
NODES	The number of nodes being used
NODELIST(REASON)	The specific nodes being used or the reason if pending

Example `squeue` output

JOBID PARTITION     NAME     USER    ST    TIME  NODES NODELIST(REASON)
all-nodes   dbnsfp username1  R     7:23     1 tenant-name-slurm-compute-dynamic-13
dynamic-8 sys/dash username1  R  3:06:49     1 tenant-name-slurm-compute-dynamic-01
dynamic-8 sys/dash username2 PD     0:00     1 (Resources)
dynamic-8 sys/dash username2  R  2:18:27     1 tenant-name-slurm-compute-dynamic-04

Understanding `sinfo` output

The following table explains the output columns of the sinfo command:

Column	Description
PARTITION	Name of the partition.
AVAIL	Availability of the partition (`up` for available, `inact` for inactive).
TIMELIMIT	Maximum time that jobs can run in this partition (usually `infinite`).
NODES	Number of nodes available or in use in the partition.
STATE	Current state of the nodes:
	- `alloc`: All of the nodes are allocated to jobs.
	- `idle~`: Nodes are idle with possibly some jobs pending.
	- `down`: Nodes are down and not available for jobs.
	- `mix`: Nodes are partially allocated, some resources are free.
NODELIST	Specific nodes assigned to the partition.

Example `sinfo` Output

PARTITION                           AVAIL  TIMELIMIT  NODES  STATE NODELIST
static                              inact   infinite      1   down tenant-name-slurm-compute-template
dynamic-8cores-16g*                    up   infinite      4  alloc tenant-name-slurm-compute-dynamic-[01-04]
dynamic-16cores-32g                    up   infinite      4  idle~ tenant-name-slurm-compute-dynamic-[05-08]
dynamic-16cores-64g                    up   infinite      4  idle~ tenant-name-slurm-compute-dynamic-[09-12]
dynamic-16cores-128g                   up   infinite      1  idle~ tenant-name-slurm-compute-dynamic-14
dynamic-16cores-128g                   up   infinite      1    mix tenant-name-slurm-compute-dynamic-13
dynamic-a100gpu-128cores-900g-4gpus    up   infinite      1  idle~ tenant-name-slurm-compute-dynamic-gpu-01
all-nodes-cpu                          up   infinite      9  idle~ tenant-name-slurm-compute-dynamic-[05-12,14]
all-nodes-cpu                          up   infinite      1    mix tenant-name-slurm-compute-dynamic-13
all-nodes-cpu                          up   infinite      4  alloc tenant-name-slurm-compute-dynamic-[01-04]

By regularly using squeue and sinfo, users can manage their jobs more effectively and plan their resource usage according to the availability and current load of the compute cluster.

Some nodes show 32G memory on their partition but will not run jobs that have more than #SBATCH –mem 28G. Keep this in mind for other types of overheadthat might prevent a job from launching.

Note on resource requests in slurm

When submitting jobs to Slurm, it is crucial to ensure that your resource requests match what is actually available on the system. Requesting resources that exceed the system’s capabilities, such as asking for 500GB of memory on a node that offers significantly less, may lead to your job hanging indefinitely without launching or providing any failure notices. To avoid these issues, please use the sinfo command regularly to verify the available resources and configurations on the cluster. This careful checking can help ensure that your job submissions are compatible with the system’s capabilities, preventing unnecessary delays. In this case, I belive the squeue NODELIST(REASON) will show (Resources).

Monitoring jobs and node configurations on slurm

Commands

Understanding squeue uutput

Example squeue output

Understanding sinfo output

Example sinfo Output

Note on resource requests in slurm

Understanding `squeue` uutput

Example `squeue` output

Understanding `sinfo` output

Example `sinfo` Output