[slurm-users] One node does not terminate simple hostname job

Julien Tailleur julien.tailleur at gmail.com
Tue Jan 23 10:19:02 MST 2018


Dear all,

first of all, I am new to slurm and this ML; please accept my apologies 
if I do not provide all needed information. I am setting up a small 
cluster under ddebian. I have slurm & munge installed and configured and 
the controller and daemons run fine on the master node and computing 
nodes, respectively. I have thus reached the "srun -Nx /bin/hostname" 
stage and I have a weird problem...

I have 16 DELL servers, FX11-14, FX21-24, FX31-34 and FX41-44.

If I do a partition with everyone but FX11, the command

srun -N15 /bin/hostname

runs smoothly, without any lag time. When I make a partition with the 
FX11, I have a weird behaviour. If I run

srun -N16 /bin/hostname

I get the correct answer:

:~# srun -N16 /bin/hostname
FX41
FX13
FX14
FX12
FX34
FX42
FX22
FX43
FX23
FX44
FX24
FX11
FX31
FX32
FX33
FX21

But if I run sinfo, the FX11 node is stuck in the "comp" state (is this 
completing ?)

:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
First*       up   infinite      1   comp FX11
First*       up   infinite     15   idle FX[12-14,21-24,31-34,41-44]

If I wait long enough, it will be available again, and I can run again 
the same command. If I try to run the command twice rapidly, I get stuck:

:~# srun -N16 /bin/hostname
FX34
FX42
FX32
FX13
FX21
FX23
FX22
FX41
FX12
FX43
FX11
FX44
FX31
FX33
FX24
FX14
root at kandinsky:~# srun -N16 /bin/hostname
srun: job 802 queued and waiting for resources

[long pause]

srun: error: Lookup failed: Unknown host
srun: job 802 has been allocated resources
FX23
FX42
FX11
FX12
FX33
FX44
FX43
FX22
FX21
FX24
FX13
FX41
FX34
FX31
FX32
FX14

If I do a sinfo during the long pause, I get:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
First*       up   infinite      1   comp FX11
First*       up   infinite     15   idle FX[12-14,21-24,31-34,41-44]

I have run slurmd with verbose on a node that does not cause problem 
(FX14) and on (FX11). The (very long) details are below. In practice, it 
seems that FX11 does not manage to terminate the job and I have no idea 
why.... I have cut sequences of identical lines, replaced by [....] so 
that the message remains readable.

When I run the two srun command, until the second one gets stuck, this 
is what I have on FX14:

slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6001
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX14: launch task 801.0 request from 0.0 at 192.168.6.1 (port 33479)
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0 
expires:0

[...]

slurmd-FX14: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 
revoked:1516379920 expires:1516379920
slurmd-FX14: debug3: destroying job 800 state
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX14: debug:  Checking credential with 364 bytes of sig data
slurmd-FX14: debug:  task_p_slurmd_launch_request: 801.0 3
slurmd-FX14: _run_prolog: run job script took usec=4
slurmd-FX14: _run_prolog: prolog with lock for job 801 ran for 0 seconds
slurmd-FX14: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1 (FX12), 
children 0, depth 2, max_depth 2
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX14: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX14: debug:  task_p_slurmd_reserve_resources: 801 3
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6004
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX14: debug:  _rpc_signal_tasks: sending signal 995 to step 801.0 
flag 0
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6011
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX14: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX14: debug:  task_p_slurmd_release_resources: 801
slurmd-FX14: debug:  credential for job 801 revoked
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug2: container signal 18 to job 801.0
slurmd-FX14: debug:  kill jobid=801 failed: No such process
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug2: container signal 15 to job 801.0
slurmd-FX14: debug:  kill jobid=801 failed: No such process
slurmd-FX14: debug4: sent SUCCESS
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0 
expires:0

[...]

slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0 
expires:0
slurmd-FX14: debug2: set revoke expiration for jobid 801 to 1516380255 UTS
slurmd-FX14: debug:  Waiting for job 801's prolog to complete
slurmd-FX14: debug:  Finished wait for job 801's prolog to complete
slurmd-FX14: debug:  completed epilog for jobid 801
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX14: debug:  Job 801: sent epilog complete msg: rc = 0

*So that FX14 seems to reach the epilog correctly. During that time, 
FX11 gets stuck: *

slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6001
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX11: launch task 801.0 request from 0.0 at 192.168.6.1 (port 48270)
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 
revoked:1516379920 expires:1516379920
slurmd-FX11: debug3: destroying job 800 state
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX11: debug:  Checking credential with 364 bytes of sig data
slurmd-FX11: debug:  task_p_slurmd_launch_request: 801.0 0
slurmd-FX11: _run_prolog: run job script took usec=4
slurmd-FX11: _run_prolog: prolog with lock for job 801 ran for 0 seconds
slurmd-FX11: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1 (NONE), 
children 15, depth 0, max_depth 2
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX11: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX11: debug:  task_p_slurmd_reserve_resources: 801 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6004
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX11: debug:  _rpc_signal_tasks: sending signal 995 to step 801.0 
flag 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX11: debug:  task_p_slurmd_release_resources: 801
slurmd-FX11: debug:  credential for job 801 revoked
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: container signal 18 to job 801.0
slurmd-FX11: debug:  kill jobid=801 failed: No such process
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: container signal 15 to job 801.0
slurmd-FX11: debug:  kill jobid=801 failed: No such process
slurmd-FX11: debug4: sent SUCCESS
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
*
**and this sequences continues for quite some time, until at some point 
the second job goes through. This what happens on FX14 (normal again): *

slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6001
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX14: launch task 802.0 request from 0.0 at 192.168.6.1 (port 65223)
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 58: ctime:1516360694 revoked:0 
expires:0

[...]

slurmd-FX14: debug3: state for jobid 786: ctime:1516378107 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 787: ctime:1516378146 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 788: ctime:1516378437 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 789: ctime:1516378508 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 790: ctime:1516378709 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 791: ctime:1516378780 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 792: ctime:1516378962 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0 
expires:0
slurmd-FX14: debug:  Checking credential with 364 bytes of sig data
slurmd-FX14: debug:  task_p_slurmd_launch_request: 802.0 3
slurmd-FX14: _run_prolog: run job script took usec=3
slurmd-FX14: _run_prolog: prolog with lock for job 802 ran for 0 seconds
slurmd-FX14: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1 (FX12), 
children 0, depth 2, max_depth 2
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX14: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX14: debug:  task_p_slurmd_reserve_resources: 802 3
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6004
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX14: debug:  _rpc_signal_tasks: sending signal 995 to step 802.0 
flag 0
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6011
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX14: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX14: debug:  task_p_slurmd_release_resources: 802
slurmd-FX14: debug:  credential for job 802 revoked
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug2: container signal 18 to job 802.0
slurmd-FX14: debug:  kill jobid=802 failed: No such process
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug2: container signal 15 to job 802.0
slurmd-FX14: debug:  kill jobid=802 failed: No such process
slurmd-FX14: debug4: sent SUCCESS
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694 revoked:0 
expires:0

[...]

slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202 
revoked:1516380202 expires:1516380202
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202 revoked:0 
expires:0
slurmd-FX14: debug2: set revoke expiration for jobid 802 to 1516380326 UTS
slurmd-FX14: debug:  Waiting for job 802's prolog to complete
slurmd-FX14: debug:  Finished wait for job 802's prolog to complete
slurmd-FX14: debug:  completed epilog for jobid 802
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX14: debug:  Job 802: sent epilog complete msg: rc = 0

*while this is what triggered the transition on FX11: *

slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX11: debug:  task_p_slurmd_release_resources: 801
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 
revoked:1516380131 expires:1516380131
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 revoked:0 
expires:0
slurmd-FX11: debug2: set revoke expiration for jobid 801 to 1516380322 UTS
slurmd-FX11: debug:  Waiting for job 801's prolog to complete
slurmd-FX11: debug:  Finished wait for job 801's prolog to complete
slurmd-FX11: debug:  completed epilog for jobid 801
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX11: debug:  Job 801: sent epilog complete msg: rc = 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6001
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX11: launch task 802.0 request from 0.0 at 192.168.6.1 (port 10383)
slurmd-FX11: debug:  Checking credential with 364 bytes of sig data
slurmd-FX11: debug:  task_p_slurmd_launch_request: 802.0 0
slurmd-FX11: _run_prolog: run job script took usec=4
slurmd-FX11: _run_prolog: prolog with lock for job 802 ran for 0 seconds
slurmd-FX11: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1 (NONE), 
children 15, depth 0, max_depth 2
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX11: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX11: debug:  task_p_slurmd_reserve_resources: 802 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6004
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX11: debug:  _rpc_signal_tasks: sending signal 995 to step 802.0 
flag 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX11: debug:  task_p_slurmd_release_resources: 802
slurmd-FX11: debug:  credential for job 802 revoked
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: container signal 18 to job 802.0
slurmd-FX11: debug:  kill jobid=802 failed: No such process
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: container signal 15 to job 802.0
slurmd-FX11: debug:  kill jobid=802 failed: No such process
slurmd-FX11: debug4: sent SUCCESS
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 
revoked:1516380131 expires:1516380131
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 revoked:0 
expires:0
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202 
revoked:1516380202 expires:1516380202
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202 revoked:0 
expires:0
slurmd-FX11: debug2: set revoke expiration for jobid 802 to 1516380393 UTS
slurmd-FX11: debug:  Waiting for job 802's prolog to complete
slurmd-FX11: debug:  Finished wait for job 802's prolog to complete
slurmd-FX11: debug:  completed epilog for jobid 802
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX11: debug:  Job 802: sent epilog complete msg: rc = 0

As you can see, job 802 also gets stuck. Any idea on where I could look 
for what is going wrong?

Best,

Julien Tailleur

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180123/96fe4a43/attachment-0001.html>


More information about the slurm-users mailing list