[slurm-users] Regression with srun and task/affinity
    Jason Bacon 
    bacon4000 at gmail.com
       
    Sun Dec 16 08:02:25 MST 2018
    
    
  
Good morning,
We've been running 17.02.11 for a long time and upon testing an upgrade 
to the 18 series, we discovered a regression.  It appeared somewhere 
between 17.02.11 and 17.11.7.
Everything works fine under 17.02.11.
Under later versions, everything is fine if I don't use srun or if I use 
TaskPlugin=task/none.
Just wondering if someone can suggest where to look in the source code 
for this. If I can just pinpoint where the problem is, I'm sure I can 
come up with a solution pretty quickly.  I've poked around a bit but 
have not spotted anything yet.  If this doesn't look familiar to anyone, 
I'll dig deeper and figure it out eventually. Just don't want to 
duplicate someone's effort if this is something that's been fixed 
already on other platforms.
Below is output from a failed srun and successful sbatch --array and 
openmpi jobs.
Thanks,
     Jason
Failing job:
FreeBSD login.wren  bacon ~ 474: srun hostname
srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
srun: error: Task launch for 82.0 failed on node compute-001: Zero Bytes 
were transmitted or received
srun: error: Application launch failed: Zero Bytes were transmitted or 
received
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Tail of slurmctld log:
[2018-12-01T16:29:09.873] debug2: got 1 threads to send out
[2018-12-01T16:29:09.874] debug2: Tree head got back 0 looking for 2
[2018-12-01T16:29:09.874] debug3: Tree sending to compute-001
[2018-12-01T16:29:09.874] debug3: Tree sending to compute-002
[2018-12-01T16:29:09.874] debug2: slurm_connect failed: Connection refused
[2018-12-01T16:29:09.874] debug2: Error connecting slurm stream socket 
at 192.168.1.13:6818: Connection refused
[2018-12-01T16:29:09.874] debug3: connect refused, retrying
[2018-12-01T16:29:09.874] debug4: orig_timeout was 10000 we have 0 steps 
and a timeout of 10000
[2018-12-01T16:29:10.087] debug2: Processing RPC: 
MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2018-12-01T16:29:10.087] debug2: Tree head got back 1
[2018-12-01T16:29:10.087] debug2: _slurm_rpc_node_registration complete 
for compute-002 usec=97
[2018-12-01T16:29:10.917] debug2: slurm_connect failed: Connection refused
[2018-12-01T16:29:10.917] debug2: Error connecting slurm stream socket 
at 192.168.1.13:6818: Connection refused
[2018-12-01T16:29:11.976] debug2: slurm_connect failed: Connection refused
[2018-12-01T16:29:11.976] debug2: Error connecting slurm stream socket 
at 192.168.1.13:6818: Connection refused
[2018-12-01T16:29:12.007] debug2: Testing job time limits and checkpoints
[2018-12-01T16:29:13.011] debug2: slurm_connect failed: Connection refused
[2018-12-01T16:29:13.011] debug2: Error connecting slurm stream socket 
at 192.168.1.13:6818: Connection refused
Successful sbatch --array:
#!/bin/sh -e
#SBATCH --array=1-8
hostname
FreeBSD login.wren  bacon ~ 462: more slurm-69_8.out
cpu-bind=MASK - compute-002, task  0  0 [64261]: mask 0x8 set
compute-002.wren
Successful openmpi:
#!/bin/sh -e
#SBATCH --ntasks=8
mpirun --report-bindings ./mpi-bench 3
FreeBSD login.wren  bacon ~/Data/mpi-bench/trunk 468: more slurm-81.out
cpu-bind=MASK - compute-001, task  0  0 [64589]: mask 0xf set
CPU 0 is set
CPU 1 is set
CPU 2 is set
CPU 3 is set
CPU 0 is set
CPU 1 is set
[compute-001.wren:64590] MCW rank 0 bound to socket 0[core 0[hwt 0]], 
socket 0[core 1[hwt 0]]: [B/B][./.]
[compute-001.wren:64590] MCW rank 1 bound to socket 1[core 2[hwt 0]], 
socket 1[core 3[hwt 0]]: [./.][B/B]
[compute-001.wren:64590] MCW rank 2 bound to socket 0[core 0[hwt 0]], 
socket 0[core 1[hwt 0]]: [B/B][./.]
[compute-001.wren:64590] MCW rank 3 bound to socket 1[core 2[hwt 0]], 
socket 1[core 3[hwt 0]]: [./.][B/B]
FreeBSD login.wren  bacon ~ 474: srun hostname
srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
srun: error: Task launch for 82.0 failed on node compute-001: Zero Bytes 
were transmitted or received
srun: error: Application launch failed: Zero Bytes were transmitted or 
received
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
-- 
Earth is a beta site.
    
    
More information about the slurm-users
mailing list