[slurm-users] odd binding interaction with hint=nomultithread

Henderson, Brent brent.henderson at hpe.com
Mon Aug 8 15:11:38 UTC 2022


I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone might be able to help with.  I took a scan through the e-mail list but didn't see this one - apologies if I missed it.  Maybe I just need a better understanding on why this is happening but feels like a bug.

The issue is that if I include the hint=nomultithread to an salloc (or sbatch) it seems to break the binding for the srun within it.  Works find if it is a direct srun.

Here are the examples of running the sruns directly and things look good:

~> srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP  - cn4, task  0  0 [103837]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [103838]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [103839]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [103840]: mask 0x1000000000000 set

~> srun --hint=nomultithread -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP  - cn4, task  0  0 [103992]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [103993]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [103994]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [103995]: mask 0x1000000000000 set

And here are the sruns wrapped by an salloc:

~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282077
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MAP  - cn4, task  0  0 [169441]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [169442]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [169443]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [169444]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282077

~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282078
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [169586]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  1  1 [169587]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  2  2 [169588]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  3  3 [169589]: mask 0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282078

I do see that the binding has changed to cpu-bind=MASK.  Maybe that is a clue.  :)  Even if I send in a mask, mine is not fully used in the presence of the hint:

~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282084
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [125303]: mask 0x1 set
cpu-bind=MASK - cn4, task  1  1 [125304]: mask 0x1000 set
cpu-bind=MASK - cn4, task  2  2 [125305]: mask 0x100000000 set
cpu-bind=MASK - cn4, task  3  3 [125306]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282084

~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282085
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [125462]: mask 0x1 set
cpu-bind=MASK - cn4, task  1  1 [125463]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  2  2 [125464]: mask 0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  3  3 [125465]: mask 0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282085

Note that the mask is ignored for tasks 1, 2, and 3 in this latter case.  Pretty sure my syntax is correct as it worked in the first test without the hint.   I also have 22.05.0 installed but not active.  I'll try it with that later today and report the results.

Brent

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220808/b7bb0e6e/attachment-0001.htm>


More information about the slurm-users mailing list