[slurm-users] odd binding interaction with hint=nomultithread
Henderson, Brent
brent.henderson at hpe.com
Mon Aug 8 15:11:38 UTC 2022
I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone might be able to help with. I took a scan through the e-mail list but didn't see this one - apologies if I missed it. Maybe I just need a better understanding on why this is happening but feels like a bug.
The issue is that if I include the hint=nomultithread to an salloc (or sbatch) it seems to break the binding for the srun within it. Works find if it is a direct srun.
Here are the examples of running the sruns directly and things look good:
~> srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP - cn4, task 0 0 [103837]: mask 0x1 set
cpu-bind=MAP - cn4, task 1 1 [103838]: mask 0x10000 set
cpu-bind=MAP - cn4, task 2 2 [103839]: mask 0x100000000 set
cpu-bind=MAP - cn4, task 3 3 [103840]: mask 0x1000000000000 set
~> srun --hint=nomultithread -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP - cn4, task 0 0 [103992]: mask 0x1 set
cpu-bind=MAP - cn4, task 1 1 [103993]: mask 0x10000 set
cpu-bind=MAP - cn4, task 2 2 [103994]: mask 0x100000000 set
cpu-bind=MAP - cn4, task 3 3 [103995]: mask 0x1000000000000 set
And here are the sruns wrapped by an salloc:
~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282077
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MAP - cn4, task 0 0 [169441]: mask 0x1 set
cpu-bind=MAP - cn4, task 1 1 [169442]: mask 0x10000 set
cpu-bind=MAP - cn4, task 2 2 [169443]: mask 0x100000000 set
cpu-bind=MAP - cn4, task 3 3 [169444]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282077
~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282078
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task 0 0 [169586]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task 1 1 [169587]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task 2 2 [169588]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task 3 3 [169589]: mask 0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282078
I do see that the binding has changed to cpu-bind=MASK. Maybe that is a clue. :) Even if I send in a mask, mine is not fully used in the presence of the hint:
~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282084
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task 0 0 [125303]: mask 0x1 set
cpu-bind=MASK - cn4, task 1 1 [125304]: mask 0x1000 set
cpu-bind=MASK - cn4, task 2 2 [125305]: mask 0x100000000 set
cpu-bind=MASK - cn4, task 3 3 [125306]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282084
~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282085
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task 0 0 [125462]: mask 0x1 set
cpu-bind=MASK - cn4, task 1 1 [125463]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task 2 2 [125464]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task 3 3 [125465]: mask 0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282085
Note that the mask is ignored for tasks 1, 2, and 3 in this latter case. Pretty sure my syntax is correct as it worked in the first test without the hint. I also have 22.05.0 installed but not active. I'll try it with that later today and report the results.
Brent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220808/b7bb0e6e/attachment-0001.htm>
More information about the slurm-users
mailing list