Recently (past 3 days) on our NVIDIA DGX A100 systems running Ubuntu 22.04.5 and slurm 24.11.5 we have had jobs that ask for a gpu, get started by Slurm, but fail to be given a GPU and then fail.
In the slurmctld log we see a line like:
[2025-07-22T02:46:29.697] error: gres/gpu: job 6919154 node A100-04 no resources selected
on the slurmd log I see no errors for the job but there is a line like
[2025-07-22T02:46:29.757] [6919154.extern] task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 195:0 rwm(/dev/nvidia0)
for all 8 of the GPUs on the node.
Other jobs still seem to start up and get a GPU fine.
If you look at the job stats one sees:
ReqTRES : billing=7,cpu=1,gres/gpu=1,mem=96G,node=1 AllocTRES : billing=3,cpu=1,mem=96G,node=1
showing that even though the gpu was requested, it was not allocated.
Occasionly on these boxes (and only these -- my Dell Rocky 8 boxes with GPUS have no problem) we see the nodes go into drain mode with the "res/gpu GRES core specification ... doesn't match socket boundaries." message as per https://support.schedmd.com/show_bug.cgi?id=22498 It does seem to happen after slurmctld restart.
I then restart slurmd on the nodes and can resume SLURM on the nodes whenever that happens.
Otherwise nothing has changed on these nodes with SLURM config or the OS in over a month.
Definition of the nodes are
NodeName=A100-[01-04] \ CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 \ ThreadsPerCore=1 RealMemory=1031000 MemSpecLimit=2048 \ TmpDisk=1400000 Feature=amd,epyc,a100 \ Gres=gpu:a100-sxm4-40gb:8
and gres.conf on the nodes is simply AutoDetect=nvml
--------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.