[slurm-users] Job not running with Resource Reason even though resources appear to be available

Tue Jan 26 21:00:06 UTC 2021

While doing more investigation I found an interesting situation.

I have a 32 core (2 x 16 core Xeon) node with the 10 RTX cards where
all 10 cards have affinity to just one socket (cores 0-15 as shown
by 'nvidia-smi topo -m').  The current running jobs on it are
using 5 GPUS and 15 cores

# scontrol show node=rtx-04 | grep gres
    CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
    AllocTRES=cpu=15,mem=220G,gres/gpu=5

Checking /sys/fs/cgroup I see these jobs are using cores 0-14

# grep . /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38409/cpuset.cpus:12-14
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38670/cpuset.cpus:0-2
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38673/cpuset.cpus:3-5
/sys/fs/cgroup/cpuset/slurm/uid_5829/job_49088/cpuset.cpus:9-11
/sys/fs/cgroup/cpuset/slurm/uid_8285/job_49048/cpuset.cpus:6-8

If I submit a job to rtx-04 asking for 1 core and 1 GPU the job runs
no problem and it uses core 15.  And then if I submit more jobs asking
for a GPU they run fine on core 16 and up.

Now if I cancel my jobs so I am back to the jobs using 5 GPUS and 15 cores
and then submit a job asking for 2 cores and 1 GPU, the job
stays in Pending state and refused to run on rtx-04.

So before submitting any bug report I decided to upgrade to the latest SLURM 
version.  I upgraded from 20.02.03 to 20.11.3 (with those jobs still running 
on rtx-04) and now the problem has gone away.  I can submit a 2 core and 1 GPU 
job and it runs immediately.

So my problem seems fixed, but in the update I noticed a wierd thing happen.
Now SLURM insistes that the Cores in gres.conf must be set to Cores=0-31
even though 'nvidia-smi topo -m' still says 0-15.  I decided to just remove
the Cores= setting from /etc/slurm/gres.conf

So before the update slurmd.log has:

[2021-01-26T03:07:45.673] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 
ID=7696487 File=/dev/nvidia0 Cores=0-15 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0

and after the update

[2021-01-26T14:31:47.282] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 
ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0

This is fine with me as I want SLURM to ignore GPU affinity on these nodes
but it is curious.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

On Mon, 25 Jan 2021 10:07am, Paul Raines wrote:

>
> I tried submitting jobs with --gres-flags=disable-binding but
> this has not made any difference.  Jobs asking for GPUs are still only
> being run if a core defined in gres.conf for the GPU is free.
>
> Basically seems the option is ignored.
>
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Sun, 24 Jan 2021 11:39am, Paul Raines wrote:
>
>>  Thanks Chris.
>>
>>  I think you have identified the issue here or are very close.  My
>>  gres.conf on
>>  the rtx-04 node for example is:
>>
>>  AutoDetect=nvml
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Cores=0-15
>>  Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 Cores=0-15
>>
>>  There are 32 cores (HT is off).  But the daughter card that holds all
>>  10 of the RTX8000s connects to only one socket as can be seen from
>>  'nvidia-smi topo -m'
>>
>>  Its odd though in that my tests on my identically configured
>>  rtx6000 partition did not show that behavior but maybe it is
>>  due to just the "random" cores that got assigned to jobs there
>>  all having a least one core on the "right" socket.
>>
>>  Anyway, how do I turn off this "affinity enforcment" as it is
>>  more important that a job run with a GPU on its non-affinity socket
>>  than just wait and not run at all?
>>
>>  Thanks
>>
>>  -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>> 
>> 
>>
>>  On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:
>>
>>>   On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:
>>>
>>>>   Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
>>>>   But the others seem to always only get half used (except rtx-07
>>>>   which somehow gets 6 used so another wierd thing).
>>>>
>>>>   Again if I submit non-GPU jobs, they end up allocating all hte
>>>>   cores/cpus on the nodes just fine.
>>>
>>>   What does your gres.conf look like for these nodes?
>>>
>>>   One thing I've seen in the past is where the core specifications for the
>>>   GPUs
>>>   are out of step with the hardware and so Slurm thinks they're on the
>>>   wrong
>>>   socket.  Then when all the cores in that socket are used up Slurm won't
>>>   put
>>>   more GPU jobs on the node without the jobs explicitly asking to not do
>>>   locality.
>>>
>>>   One thing I've noticed is that in prior to Slurm 20.02 the documentation
>>>   for
>>>   gres.conf used to say:
>>> 
>>> #   If your cores contain multiple threads only the first thread
>>> #   (processing unit) of each core needs to be listed.
>>>
>>>   but that language is gone from 20.02 and later and the change isn't
>>>   mentioned
>>>   in the release notes for 20.02 so I'm not sure what happened there, the
>>>   only
>>>   clue is this commit:
>>>
>>>   https://github.com/SchedMD/slurm/commit/
>>>   7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
>>>   cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89
>>>
>>>   which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."
>>>
>>>   Best of luck!
>>>   Chris
>>>   --
>>>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>>
>>>
>>>
>>>
>>>
>>> 
>> 
>> 
>> 
>
>