[slurm-users] Questions about dynamic nodes

Kevin Buckley Kevin.Buckley at pawsey.org.au
Wed Sep 28 04:46:58 UTC 2022


On 2022/09/27 23:26, Groner, Rob wrote:
> I have 2 nodes that offer a "gc" feature.  Node t-gc-1202 is "normal", and node t-gc-1201 is dynamic.
> I can successfully remove t-gc-1201 and bring it back dynamically.  Once I bring it back, that node
> appears JUST LIKE the "normal" node in the sinfo output, as seen here:
> 
> [rug262 at testsch (RC) slurm] sinfo -o "%20N  %10c  %10m  %25f  %10G "
> NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
> t-sc-[1101-1104]      48          358400      nogpu,sc                   (null)
> t-gc-1201             48          385420      gpu,gc,a100                gpu:2(S:0-
> t-gc-1202             48          358400      gpu,gc,a100                gpu:2
> t-ic-1051             36          500000      ic,a40                     (null)
> 
> When I execute a job requiring 24 CPUs and the gc feature, then it runs on t-gc-1202 only.
> If I sbatch 3 of the same jobs at once, then 2 run on t-gc-1202 and the 3rd is pending for
> resources.

Always assumed Features were a "boolean indicator", as in a node
either has it, or it doesn't have it, for scheduling purposes, but
the behaviour you are seeing suggests that Slurm MAY be 'counting"
TWO nodes as having the feature, and then giving up after it has
scheduled TWO jobs, which, modulo some other countable resource
being exhausted by the two running jobs, seems wrong.

So, what happens if you sbatch 3 of the same jobs, each asking
for 16 CPUs and the gc feature ?

If the three all start on t-gc-1202, then there'd seem to be
something screwed and tied into the "bringing t-gc-1201 back
dynamically", but if only two start, and start on t-gc-1202,
then it points towards the total number of gc features, rather,
total number of nodes with the gc feature, being counted, or
some other countable resource being exhausted.





More information about the slurm-users mailing list