Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some jobs requesting --gres=gpu:reqcount GPUs are allocated less than reqcount GPUs if some of the node's GPUs are already in use by other jobs.
We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the following test script:
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=1-00:00:00
echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE sleep 10d
Submit a job to the node:
sbatch *--gres=gpu:1* --nodelist=ares-c02-06 job.sh
Submitted batch job 1950559
The job starts. Now submit the script again, asking for 2 GPUs
sbatch*--gres=gpu:2* --nodelist=ares-c02-06 job.sh
Submitted batch job 1950567
This second job should not start as the ressources are not available.
Surprisingly, _both jobs are running_
$ squeue -w ares-c02-06 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1950567 gpuai job.sh jan.gmys R 1:55 1 ares-c02-06 1950559 gpuai job.sh jan.gmys R 2:58 1 ares-c02-06
In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the requested gres/gpu=2.
# sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60 JobID ReqTRES AllocTRES
1950559 billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1 1950567 billing=1,cpu=1,*gres/gpu=2*,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,*gres/gpu=1*,mem=4G,node=1
The output of both jobs
$ cat slurm-1950559.out CUDA_VISIBLE_DEVICES: 0 SLURM_JOB_GPUS: 0 SLURM_GPUS_ON_NODE: 1
$ cat slurm-1950567.out CUDA_VISIBLE_DEVICES: 0 SLURM_JOB_GPUS: 1 SLURM_GPUS_ON_NODE: 1
CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1.
*Environment: *
- RHEL 9.1
- slurm 25.05.0
- The GRES configuration seems fine, AutoDetect is off :
# /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v [2025-07-22T16:44:05.548] GRES: Global*AutoDetect=off*(4) [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195, minor 0 [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195, minor 1 [2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm [2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm [2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487 File=/dev/nvidia[0-1] Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
and 'scontrol show node'
NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:L40S:2 NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0 OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30 07:36:03 EDT 2022 RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1 State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpuai BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54 LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2 AllocTRES=
*Debug notes :*
- When using the --gpus option instead of --gres everything works as expected : the second job is PENDING (Ressources)
- Tried both, ConstrainDevices=on/off in cgroup.conf, same result
- The same is happening on other multi-GPU nodes of the cluster
- When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are taken, the second (--gres=gpu:1) job waits correctly.
- When both GPUs are free, the --gres=gpu:2 job, correctly gets both GPUs : CUDA_VISIBLE_DEVICES: 0,1
- It worked in slurm 22.05.9 (we recently upgraded in two steps -> 23.11.11 -> 25.05.0)
- The only viable workaround I see for the moment is to intercept --gres and --gpus-per-node (don't even have it in job_desc I think!! :-/) options in job_submit.lua and force users to use the --gpus option, which seems to works fine.
Anyone experienced similar issues?
Any idea how to solve this would be highly appreciated.
Jan
==