slurm 25.05.0 : allocates less GPUs than requested by --gres=gpu:count - slurm-users

23 Jul 2025


      Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some 
jobs requesting --gres=gpu:reqcount GPUs are allocated less than 
reqcount GPUs if some of the node's GPUs are already in use by other jobs.
We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the 
following test script:
...
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1-00:00:00
echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES
echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS
echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE
sleep 10d
Submit a job to the node:
...
sbatch *--gres=gpu:1* --nodelist=ares-c02-06 job.sh
Submitted batch job 1950559
The job starts. Now submit the script again, asking for 2 GPUs
...
sbatch*--gres=gpu:2* --nodelist=ares-c02-06 job.sh
Submitted batch job 1950567
This second job should not start as the ressources are not available.
Surprisingly, _both jobs are running_
...
$ squeue -w ares-c02-06
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            1950567     gpuai   job.sh jan.gmys  R       1:55      1 ares-c02-06
            1950559     gpuai   job.sh jan.gmys  R       2:58      1 ares-c02-06
In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the 
requested gres/gpu=2.
...
# sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60
      JobID                                       ReqTRES                                 AllocTRES

1950559      billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1
1950567      billing=1,cpu=1,*gres/gpu=2*,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,*gres/gpu=1*,mem=4G,node=1

The output of both jobs
...
$ cat slurm-1950559.out
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  0
SLURM_GPUS_ON_NODE:  1
$ cat slurm-1950567.out
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  1
SLURM_GPUS_ON_NODE:  1
CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1.
*Environment: *
- RHEL 9.1
- slurm 25.05.0
- The GRES configuration seems fine, AutoDetect is off :
...
# /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v
[2025-07-22T16:44:05.548] GRES: Global*AutoDetect=off*(4)
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195,
minor 0
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195,
minor 1
[2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487
File=/dev/nvidia[0-1] Links=(null)
Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
and 'scontrol show node'
...
NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=gpu:L40S:2
    NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0
    OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30
07:36:03 EDT 2022
    RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1
    State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
    Partitions=gpuai
    BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54
    LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None
    CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2
    AllocTRES=
*Debug notes :*
- When using the --gpus option instead of --gres everything works as 
expected : the second job is PENDING (Ressources)
- Tried both, ConstrainDevices=on/off in cgroup.conf, same result
- The same is happening on other multi-GPU nodes of the cluster
- When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are 
taken, the second (--gres=gpu:1) job waits correctly.
- When both GPUs are free, the --gres=gpu:2 job, correctly gets both 
GPUs : CUDA_VISIBLE_DEVICES: 0,1
- It worked in slurm 22.05.9 (we recently upgraded in two steps -> 
23.11.11 -> 25.05.0)
- The only viable workaround I see for the moment is to intercept --gres 
and --gpus-per-node (don't even have it in job_desc I think!! :-/) 
options in job_submit.lua and force users to use the --gpus option, 
which seems to works fine.
Anyone experienced similar issues?
Any idea how to solve this would be highly appreciated.
Jan
==
-- 
Jan Gmys
Ingénieur de recherche
Support HPC/IA pour la plateforme MesoNET
Mésocentre de Calcul Scientifique Intensif de l'Université de Lille