[slurm-users] Questions about dynamic nodes
Kevin Buckley
Kevin.Buckley at pawsey.org.au
Wed Sep 28 04:46:58 UTC 2022
On 2022/09/27 23:26, Groner, Rob wrote:
> I have 2 nodes that offer a "gc" feature. Node t-gc-1202 is "normal", and node t-gc-1201 is dynamic.
> I can successfully remove t-gc-1201 and bring it back dynamically. Once I bring it back, that node
> appears JUST LIKE the "normal" node in the sinfo output, as seen here:
>
> [rug262 at testsch (RC) slurm] sinfo -o "%20N %10c %10m %25f %10G "
> NODELIST CPUS MEMORY AVAIL_FEATURES GRES
> t-sc-[1101-1104] 48 358400 nogpu,sc (null)
> t-gc-1201 48 385420 gpu,gc,a100 gpu:2(S:0-
> t-gc-1202 48 358400 gpu,gc,a100 gpu:2
> t-ic-1051 36 500000 ic,a40 (null)
>
> When I execute a job requiring 24 CPUs and the gc feature, then it runs on t-gc-1202 only.
> If I sbatch 3 of the same jobs at once, then 2 run on t-gc-1202 and the 3rd is pending for
> resources.
Always assumed Features were a "boolean indicator", as in a node
either has it, or it doesn't have it, for scheduling purposes, but
the behaviour you are seeing suggests that Slurm MAY be 'counting"
TWO nodes as having the feature, and then giving up after it has
scheduled TWO jobs, which, modulo some other countable resource
being exhausted by the two running jobs, seems wrong.
So, what happens if you sbatch 3 of the same jobs, each asking
for 16 CPUs and the gc feature ?
If the three all start on t-gc-1202, then there'd seem to be
something screwed and tied into the "bringing t-gc-1201 back
dynamically", but if only two start, and start on t-gc-1202,
then it points towards the total number of gc features, rather,
total number of nodes with the gc feature, being counted, or
some other countable resource being exhausted.
More information about the slurm-users
mailing list