[slurm-users] Job not running with Resource Reason even though resources appear to be available

Chris Samuel chris at csamuel.org
Sat Jan 23 20:19:02 UTC 2021


On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:

> Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
> But the others seem to always only get half used (except rtx-07
> which somehow gets 6 used so another wierd thing).
> 
> Again if I submit non-GPU jobs, they end up allocating all hte
> cores/cpus on the nodes just fine.

What does your gres.conf look like for these nodes?

One thing I've seen in the past is where the core specifications for the GPUs 
are out of step with the hardware and so Slurm thinks they're on the wrong 
socket.  Then when all the cores in that socket are used up Slurm won't put 
more GPU jobs on the node without the jobs explicitly asking to not do 
locality.

One thing I've noticed is that in prior to Slurm 20.02 the documentation for 
gres.conf used to say:

# If your cores contain multiple threads only the first thread
# (processing unit) of each core needs to be listed.

but that language is gone from 20.02 and later and the change isn't mentioned 
in the release notes for 20.02 so I'm not sure what happened there, the only 
clue is this commit:

https://github.com/SchedMD/slurm/commit/
7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89

which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."

Best of luck!
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






More information about the slurm-users mailing list