Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some jobs requesting --gres=gpu:reqcount GPUs are allocated less than reqcount GPUs if some of the node's GPUs are already in use by other jobs.

We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the following test script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1-00:00:00

echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES
echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS
echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE
sleep 10d



Submit a job to the node:

sbatch --gres=gpu:1 --nodelist=ares-c02-06 job.sh

Submitted batch job 1950559

The job starts. Now submit the script again, asking for 2 GPUs

sbatch --gres=gpu:2 --nodelist=ares-c02-06 job.sh

Submitted batch job 1950567
This second job should not start as the ressources are not available.

Surprisingly, both jobs are running

$ squeue -w ares-c02-06
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1950567     gpuai   job.sh jan.gmys  R       1:55      1 ares-c02-06
           1950559     gpuai   job.sh jan.gmys  R       2:58      1 ares-c02-06

In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the requested gres/gpu=2.

# sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60
     JobID                                       ReqTRES                                 AllocTRES 
---------- --------------------------------------------- ------------------------------------------------------------ 
   1950559      billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1 
   1950567      billing=1,cpu=1,gres/gpu=2,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1 

The output of both jobs

 
$ cat slurm-1950559.out 
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  0
SLURM_GPUS_ON_NODE:  1

$ cat slurm-1950567.out 
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  1
SLURM_GPUS_ON_NODE:  1

CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1.


Environment:

- RHEL 9.1

- slurm 25.05.0

- The GRES configuration seems fine, AutoDetect is off :

# /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v
[2025-07-22T16:44:05.548] GRES: Global AutoDetect=off(4)
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195,
minor 0
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195,
minor 1
[2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487
File=/dev/nvidia[0-1] Links=(null)
Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
and 'scontrol show node'
NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:L40S:2
   NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0
   OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30
07:36:03 EDT 2022 
   RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1
   State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=gpuai 
   BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54
   LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None
   CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2
   AllocTRES=


Debug notes :

- When using the --gpus option instead of --gres everything works as expected : the second job is PENDING (Ressources)

- Tried both, ConstrainDevices=on/off in cgroup.conf, same result

- The same is happening on other multi-GPU nodes of the cluster

- When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are taken, the second (--gres=gpu:1) job waits correctly. 

- When both GPUs are free, the --gres=gpu:2 job, correctly gets both GPUs : CUDA_VISIBLE_DEVICES: 0,1

- It worked in slurm 22.05.9 (we recently upgraded in two steps -> 23.11.11 -> 25.05.0)

- The only viable workaround I see for the moment is to intercept --gres and --gpus-per-node (don't even have it in job_desc I think!! :-/) options in job_submit.lua and force users to use the --gpus option, which seems to works fine.


Anyone experienced similar issues? 

Any idea how to solve this would be highly appreciated.


Jan

==




-- 
Jan Gmys
Ingénieur de recherche
Support HPC/IA pour la plateforme MesoNET
Mésocentre de Calcul Scientifique Intensif de l'Université de Lille