[slurm-users] One node does not terminate simple hostname job
Julien Tailleur
julien.tailleur at gmail.com
Tue Jan 23 10:19:02 MST 2018
Dear all,
first of all, I am new to slurm and this ML; please accept my apologies
if I do not provide all needed information. I am setting up a small
cluster under ddebian. I have slurm & munge installed and configured and
the controller and daemons run fine on the master node and computing
nodes, respectively. I have thus reached the "srun -Nx /bin/hostname"
stage and I have a weird problem...
I have 16 DELL servers, FX11-14, FX21-24, FX31-34 and FX41-44.
If I do a partition with everyone but FX11, the command
srun -N15 /bin/hostname
runs smoothly, without any lag time. When I make a partition with the
FX11, I have a weird behaviour. If I run
srun -N16 /bin/hostname
I get the correct answer:
:~# srun -N16 /bin/hostname
FX41
FX13
FX14
FX12
FX34
FX42
FX22
FX43
FX23
FX44
FX24
FX11
FX31
FX32
FX33
FX21
But if I run sinfo, the FX11 node is stuck in the "comp" state (is this
completing ?)
:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
First* up infinite 1 comp FX11
First* up infinite 15 idle FX[12-14,21-24,31-34,41-44]
If I wait long enough, it will be available again, and I can run again
the same command. If I try to run the command twice rapidly, I get stuck:
:~# srun -N16 /bin/hostname
FX34
FX42
FX32
FX13
FX21
FX23
FX22
FX41
FX12
FX43
FX11
FX44
FX31
FX33
FX24
FX14
root at kandinsky:~# srun -N16 /bin/hostname
srun: job 802 queued and waiting for resources
[long pause]
srun: error: Lookup failed: Unknown host
srun: job 802 has been allocated resources
FX23
FX42
FX11
FX12
FX33
FX44
FX43
FX22
FX21
FX24
FX13
FX41
FX34
FX31
FX32
FX14
If I do a sinfo during the long pause, I get:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
First* up infinite 1 comp FX11
First* up infinite 15 idle FX[12-14,21-24,31-34,41-44]
I have run slurmd with verbose on a node that does not cause problem
(FX14) and on (FX11). The (very long) details are below. In practice, it
seems that FX11 does not manage to terminate the job and I have no idea
why.... I have cut sequences of identical lines, replaced by [....] so
that the message remains readable.
When I run the two srun command, until the second one gets stuck, this
is what I have on FX14:
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6001
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX14: launch task 801.0 request from 0.0 at 192.168.6.1 (port 33479)
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0
expires:0
[...]
slurmd-FX14: debug3: state for jobid 793: ctime:1516379034 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:1516379920 expires:1516379920
slurmd-FX14: debug3: destroying job 800 state
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX14: debug: Checking credential with 364 bytes of sig data
slurmd-FX14: debug: task_p_slurmd_launch_request: 801.0 3
slurmd-FX14: _run_prolog: run job script took usec=4
slurmd-FX14: _run_prolog: prolog with lock for job 801 ran for 0 seconds
slurmd-FX14: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1 (FX12),
children 0, depth 2, max_depth 2
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX14: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX14: debug: task_p_slurmd_reserve_resources: 801 3
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6004
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX14: debug: _rpc_signal_tasks: sending signal 995 to step 801.0
flag 0
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6011
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX14: debug: _rpc_terminate_job, uid = 64030
slurmd-FX14: debug: task_p_slurmd_release_resources: 801
slurmd-FX14: debug: credential for job 801 revoked
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug2: container signal 18 to job 801.0
slurmd-FX14: debug: kill jobid=801 failed: No such process
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug2: container signal 15 to job 801.0
slurmd-FX14: debug: kill jobid=801 failed: No such process
slurmd-FX14: debug4: sent SUCCESS
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug4: found jobid = 801, stepid = 0
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0
expires:0
[...]
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0
expires:0
slurmd-FX14: debug2: set revoke expiration for jobid 801 to 1516380255 UTS
slurmd-FX14: debug: Waiting for job 801's prolog to complete
slurmd-FX14: debug: Finished wait for job 801's prolog to complete
slurmd-FX14: debug: completed epilog for jobid 801
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX14: debug: Job 801: sent epilog complete msg: rc = 0
*So that FX14 seems to reach the epilog correctly. During that time,
FX11 gets stuck: *
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6001
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX11: launch task 801.0 request from 0.0 at 192.168.6.1 (port 48270)
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920
revoked:1516379920 expires:1516379920
slurmd-FX11: debug3: destroying job 800 state
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX11: debug: Checking credential with 364 bytes of sig data
slurmd-FX11: debug: task_p_slurmd_launch_request: 801.0 0
slurmd-FX11: _run_prolog: run job script took usec=4
slurmd-FX11: _run_prolog: prolog with lock for job 801 ran for 0 seconds
slurmd-FX11: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1 (NONE),
children 15, depth 0, max_depth 2
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX11: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX11: debug: task_p_slurmd_reserve_resources: 801 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6004
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX11: debug: _rpc_signal_tasks: sending signal 995 to step 801.0
flag 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
slurmd-FX11: debug: task_p_slurmd_release_resources: 801
slurmd-FX11: debug: credential for job 801 revoked
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: container signal 18 to job 801.0
slurmd-FX11: debug: kill jobid=801 failed: No such process
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: container signal 15 to job 801.0
slurmd-FX11: debug: kill jobid=801 failed: No such process
slurmd-FX11: debug4: sent SUCCESS
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
*
**and this sequences continues for quite some time, until at some point
the second job goes through. This what happens on FX14 (normal again): *
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6001
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX14: launch task 802.0 request from 0.0 at 192.168.6.1 (port 65223)
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 58: ctime:1516360694 revoked:0
expires:0
[...]
slurmd-FX14: debug3: state for jobid 786: ctime:1516378107 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 787: ctime:1516378146 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 788: ctime:1516378437 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 789: ctime:1516378508 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 790: ctime:1516378709 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 791: ctime:1516378780 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 792: ctime:1516378962 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 793: ctime:1516379034 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0
expires:0
slurmd-FX14: debug: Checking credential with 364 bytes of sig data
slurmd-FX14: debug: task_p_slurmd_launch_request: 802.0 3
slurmd-FX14: _run_prolog: run job script took usec=3
slurmd-FX14: _run_prolog: prolog with lock for job 802 ran for 0 seconds
slurmd-FX14: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1 (FX12),
children 0, depth 2, max_depth 2
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX14: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX14: debug: task_p_slurmd_reserve_resources: 802 3
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6004
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX14: debug: _rpc_signal_tasks: sending signal 995 to step 802.0
flag 0
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6011
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX14: debug: _rpc_terminate_job, uid = 64030
slurmd-FX14: debug: task_p_slurmd_release_resources: 802
slurmd-FX14: debug: credential for job 802 revoked
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug2: container signal 18 to job 802.0
slurmd-FX14: debug: kill jobid=802 failed: No such process
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug2: container signal 15 to job 802.0
slurmd-FX14: debug: kill jobid=802 failed: No such process
slurmd-FX14: debug4: sent SUCCESS
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug4: found jobid = 802, stepid = 0
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694 revoked:0
expires:0
[...]
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131 revoked:0
expires:0
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202
revoked:1516380202 expires:1516380202
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202 revoked:0
expires:0
slurmd-FX14: debug2: set revoke expiration for jobid 802 to 1516380326 UTS
slurmd-FX14: debug: Waiting for job 802's prolog to complete
slurmd-FX14: debug: Finished wait for job 802's prolog to complete
slurmd-FX14: debug: completed epilog for jobid 802
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX14: debug: Job 802: sent epilog complete msg: rc = 0
*while this is what triggered the transition on FX11: *
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
slurmd-FX11: debug: task_p_slurmd_release_resources: 801
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug4: found jobid = 801, stepid = 0
slurmd-FX11: debug2: terminate job step 801.0
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 revoked:0
expires:0
slurmd-FX11: debug2: set revoke expiration for jobid 801 to 1516380322 UTS
slurmd-FX11: debug: Waiting for job 801's prolog to complete
slurmd-FX11: debug: Finished wait for job 801's prolog to complete
slurmd-FX11: debug: completed epilog for jobid 801
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX11: debug: Job 801: sent epilog complete msg: rc = 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6001
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX11: launch task 802.0 request from 0.0 at 192.168.6.1 (port 10383)
slurmd-FX11: debug: Checking credential with 364 bytes of sig data
slurmd-FX11: debug: task_p_slurmd_launch_request: 802.0 0
slurmd-FX11: _run_prolog: run job script took usec=4
slurmd-FX11: _run_prolog: prolog with lock for job 802 ran for 0 seconds
slurmd-FX11: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1 (NONE),
children 15, depth 0, max_depth 2
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX11: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX11: debug: task_p_slurmd_reserve_resources: 802 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6004
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX11: debug: _rpc_signal_tasks: sending signal 995 to step 802.0
flag 0
slurmd-FX11: debug3: in the service_connection
slurmd-FX11: debug2: got this type of message 6011
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
slurmd-FX11: debug: task_p_slurmd_release_resources: 802
slurmd-FX11: debug: credential for job 802 revoked
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: container signal 18 to job 802.0
slurmd-FX11: debug: kill jobid=802 failed: No such process
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: container signal 15 to job 802.0
slurmd-FX11: debug: kill jobid=802 failed: No such process
slurmd-FX11: debug4: sent SUCCESS
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug4: found jobid = 802, stepid = 0
slurmd-FX11: debug2: terminate job step 802.0
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131 revoked:0
expires:0
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202
revoked:1516380202 expires:1516380202
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202 revoked:0
expires:0
slurmd-FX11: debug2: set revoke expiration for jobid 802 to 1516380393 UTS
slurmd-FX11: debug: Waiting for job 802's prolog to complete
slurmd-FX11: debug: Finished wait for job 802's prolog to complete
slurmd-FX11: debug: completed epilog for jobid 802
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
slurmd-FX11: debug: Job 802: sent epilog complete msg: rc = 0
As you can see, job 802 also gets stuck. Any idea on where I could look
for what is going wrong?
Best,
Julien Tailleur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180123/96fe4a43/attachment-0001.html>
More information about the slurm-users
mailing list