Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some 
jobs requesting --gres=gpu:reqcount GPUs are allocated less than 
reqcount GPUs if some of the node's GPUs are already in use by other jobs.
We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the 
following test script:
> #!/bin/bash
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=1
> #SBATCH --time=1-00:00:00
>
> echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES
> echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS
> echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE
> sleep 10d
Submit a job to the node:
>
> sbatch *--gres=gpu:1* --nodelist=ares-c02-06 job.sh
>
> Submitted batch job 1950559
>
The job starts. Now submit the script again, asking for 2 GPUs
> sbatch*--gres=gpu:2* --nodelist=ares-c02-06 job.sh
>
> Submitted batch job 1950567
This second job should not start as the ressources are not available.
Surprisingly, _both jobs are running_
> $ squeue -w ares-c02-06
>               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>             1950567     gpuai   job.sh jan.gmys  R       1:55      1 ares-c02-06
>             1950559     gpuai   job.sh jan.gmys  R       2:58      1 ares-c02-06
In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the 
requested gres/gpu=2.
> # sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60
>       JobID                                       ReqTRES                                 AllocTRES
> ---------- --------------------------------------------- ------------------------------------------------------------
>     1950559      billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1
>     1950567      billing=1,cpu=1,*gres/gpu=2*,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,*gres/gpu=1*,mem=4G,node=1
The output of both jobs
> $ cat slurm-1950559.out
> CUDA_VISIBLE_DEVICES:  0
> SLURM_JOB_GPUS:  0
> SLURM_GPUS_ON_NODE:  1
>
> $ cat slurm-1950567.out
> CUDA_VISIBLE_DEVICES:  0
> SLURM_JOB_GPUS:  1
> SLURM_GPUS_ON_NODE:  1
CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1.
*Environment: *
- RHEL 9.1
- slurm 25.05.0
- The GRES configuration seems fine, AutoDetect is off :
> # /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v
> [2025-07-22T16:44:05.548] GRES: Global*AutoDetect=off*(4)
> [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195,
> minor 0
> [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195,
> minor 1
> [2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487
> File=/dev/nvidia[0-1] Links=(null)
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
and 'scontrol show node'
> NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24
>     CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=gpu:L40S:2
>     NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0
>     OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30
> 07:36:03 EDT 2022
>     RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1
>     State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>     Partitions=gpuai
>     BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54
>     LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None
>     CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2
>     AllocTRES=
*Debug notes :*
- When using the --gpus option instead of --gres everything works as 
expected : the second job is PENDING (Ressources)
- Tried both, ConstrainDevices=on/off in cgroup.conf, same result
- The same is happening on other multi-GPU nodes of the cluster
- When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are 
taken, the second (--gres=gpu:1) job waits correctly.
- When both GPUs are free, the --gres=gpu:2 job, correctly gets both 
GPUs : CUDA_VISIBLE_DEVICES: 0,1
- It worked in slurm 22.05.9 (we recently upgraded in two steps -> 
23.11.11 -> 25.05.0)
- The only viable workaround I see for the moment is to intercept --gres 
and --gpus-per-node (don't even have it in job_desc I think!! :-/) 
options in job_submit.lua and force users to use the --gpus option, 
which seems to works fine.
Anyone experienced similar issues?
Any idea how to solve this would be highly appreciated.
Jan
==
-- 
Jan Gmys
Ingénieur de recherche
Support HPC/IA pour la plateforme MesoNET
Mésocentre de Calcul Scientifique Intensif de l'Université de Lille