<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Dear all,
<br>
<br>
first of all, I am new to slurm and this ML; please accept my
apologies if I do not provide all needed information. I am setting
up a small cluster under ddebian. I have slurm & munge
installed and configured and the controller and daemons run fine
on the master node and computing nodes, respectively. I have thus
reached the "srun -Nx /bin/hostname" stage and I have a weird
problem...
<br>
<br>
I have 16 DELL servers, FX11-14, FX21-24, FX31-34 and FX41-44.
<br>
<br>
If I do a partition with everyone but FX11, the command
<br>
<br>
srun -N15 /bin/hostname
<br>
<br>
runs smoothly, without any lag time. When I make a partition with
the FX11, I have a weird behaviour. If I run
<br>
<br>
srun -N16 /bin/hostname
<br>
<br>
I get the correct answer:
<br>
<br>
:~# srun -N16 /bin/hostname
<br>
FX41
<br>
FX13
<br>
FX14
<br>
FX12
<br>
FX34
<br>
FX42
<br>
FX22
<br>
FX43
<br>
FX23
<br>
FX44
<br>
FX24
<br>
FX11
<br>
FX31
<br>
FX32
<br>
FX33
<br>
FX21
<br>
<br>
But if I run sinfo, the FX11 node is stuck in the "comp" state (is
this completing ?)
<br>
<br>
:~# sinfo
<br>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
<br>
First* up infinite 1 comp FX11
<br>
First* up infinite 15 idle
FX[12-14,21-24,31-34,41-44]
<br>
<br>
If I wait long enough, it will be available again, and I can run
again the same command. If I try to run the command twice rapidly,
I get stuck:
<br>
<br>
:~# srun -N16 /bin/hostname
<br>
FX34
<br>
FX42
<br>
FX32
<br>
FX13
<br>
FX21
<br>
FX23
<br>
FX22
<br>
FX41
<br>
FX12
<br>
FX43
<br>
FX11
<br>
FX44
<br>
FX31
<br>
FX33
<br>
FX24
<br>
FX14
<br>
root@kandinsky:~# srun -N16 /bin/hostname
<br>
srun: job 802 queued and waiting for resources
<br>
<br>
[long pause]
<br>
<br>
srun: error: Lookup failed: Unknown host
<br>
srun: job 802 has been allocated resources
<br>
FX23
<br>
FX42
<br>
FX11
<br>
FX12
<br>
FX33
<br>
FX44
<br>
FX43
<br>
FX22
<br>
FX21
<br>
FX24
<br>
FX13
<br>
FX41
<br>
FX34
<br>
FX31
<br>
FX32
<br>
FX14
<br>
<br>
If I do a sinfo during the long pause, I get:
<br>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
<br>
First* up infinite 1 comp FX11
<br>
First* up infinite 15 idle
FX[12-14,21-24,31-34,41-44]
<br>
<br>
I have run slurmd with verbose on a node that does not cause
problem (FX14) and on (FX11). The (very long) details are below.
In practice, it seems that FX11 does not manage to terminate the
job and I have no idea why.... I have cut sequences of identical
lines, replaced by [....] so that the message remains readable.
<br>
<br>
When I run the two srun command, until the second one gets stuck,
this is what I have on FX14:
<br>
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6001
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
<br>
slurmd-FX14: launch task 801.0 request from <a
class="moz-txt-link-abbreviated" href="mailto:0.0@192.168.6.1">0.0@192.168.6.1</a>
(port 33479)
<br>
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518
revoked:0 expires:0
<br>
<br>
[...]
<br>
<br>
slurmd-FX14: debug3: state for jobid 793: ctime:1516379034
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:1516379920 expires:1516379920
<br>
slurmd-FX14: debug3: destroying job 800 state
<br>
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX14: debug: Checking credential with 364 bytes of sig
data
<br>
slurmd-FX14: debug: task_p_slurmd_launch_request: 801.0 3
<br>
slurmd-FX14: _run_prolog: run job script took usec=4
<br>
slurmd-FX14: _run_prolog: prolog with lock for job 801 ran for 0
seconds
<br>
slurmd-FX14: debug3: _rpc_launch_tasks: call to
_forkexec_slurmstepd
<br>
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1
(FX12), children 0, depth 2, max_depth 2
<br>
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
<br>
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
<br>
slurmd-FX14: debug3: _rpc_launch_tasks: return from
_forkexec_slurmstepd
<br>
slurmd-FX14: debug: task_p_slurmd_reserve_resources: 801 3
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6004
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
<br>
slurmd-FX14: debug: _rpc_signal_tasks: sending signal 995 to step
801.0 flag 0
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6011
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
<br>
slurmd-FX14: debug: _rpc_terminate_job, uid = 64030
<br>
slurmd-FX14: debug: task_p_slurmd_release_resources: 801
<br>
slurmd-FX14: debug: credential for job 801 revoked
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug2: container signal 18 to job 801.0
<br>
slurmd-FX14: debug: kill jobid=801 failed: No such process
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug2: container signal 15 to job 801.0
<br>
slurmd-FX14: debug: kill jobid=801 failed: No such process
<br>
slurmd-FX14: debug4: sent SUCCESS
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518
revoked:0 expires:0
<br>
<br>
[...]
<br>
<br>
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:0 expires:0
<br>
slurmd-FX14: debug2: set revoke expiration for jobid 801 to
1516380255 UTS
<br>
slurmd-FX14: debug: Waiting for job 801's prolog to complete
<br>
slurmd-FX14: debug: Finished wait for job 801's prolog to
complete
<br>
slurmd-FX14: debug: completed epilog for jobid 801
<br>
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
<br>
slurmd-FX14: debug: Job 801: sent epilog complete msg: rc = 0
<br>
<br>
<b>So that FX14 seems to reach the epilog correctly. During that
time, FX11 gets stuck:
</b><br>
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6001
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
<br>
slurmd-FX11: launch task 801.0 request from <a
class="moz-txt-link-abbreviated" href="mailto:0.0@192.168.6.1">0.0@192.168.6.1</a>
(port 48270)
<br>
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920
revoked:1516379920 expires:1516379920
<br>
slurmd-FX11: debug3: destroying job 800 state
<br>
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX11: debug: Checking credential with 364 bytes of sig
data
<br>
slurmd-FX11: debug: task_p_slurmd_launch_request: 801.0 0
<br>
slurmd-FX11: _run_prolog: run job script took usec=4
<br>
slurmd-FX11: _run_prolog: prolog with lock for job 801 ran for 0
seconds
<br>
slurmd-FX11: debug3: _rpc_launch_tasks: call to
_forkexec_slurmstepd
<br>
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1
(NONE), children 15, depth 0, max_depth 2
<br>
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
<br>
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
<br>
slurmd-FX11: debug3: _rpc_launch_tasks: return from
_forkexec_slurmstepd
<br>
slurmd-FX11: debug: task_p_slurmd_reserve_resources: 801 0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6004
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
<br>
slurmd-FX11: debug: _rpc_signal_tasks: sending signal 995 to step
801.0 flag 0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6011
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
<br>
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
<br>
slurmd-FX11: debug: task_p_slurmd_release_resources: 801
<br>
slurmd-FX11: debug: credential for job 801 revoked
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: container signal 18 to job 801.0
<br>
slurmd-FX11: debug: kill jobid=801 failed: No such process
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: container signal 15 to job 801.0
<br>
slurmd-FX11: debug: kill jobid=801 failed: No such process
<br>
slurmd-FX11: debug4: sent SUCCESS
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
<b><br>
</b><b>and this sequences continues for quite some time, until at
some point the second job goes through. This what happens on
FX14 (normal again):
</b><br>
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6001
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
<br>
slurmd-FX14: launch task 802.0 request from <a
class="moz-txt-link-abbreviated" href="mailto:0.0@192.168.6.1">0.0@192.168.6.1</a>
(port 65223)
<br>
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 58: ctime:1516360694
revoked:0 expires:0
<br>
<br>
[...]
<br>
<br>
slurmd-FX14: debug3: state for jobid 786: ctime:1516378107
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 787: ctime:1516378146
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 788: ctime:1516378437
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 789: ctime:1516378508
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 790: ctime:1516378709
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 791: ctime:1516378780
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 792: ctime:1516378962
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 793: ctime:1516379034
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:0 expires:0
<br>
slurmd-FX14: debug: Checking credential with 364 bytes of sig
data
<br>
slurmd-FX14: debug: task_p_slurmd_launch_request: 802.0 3
<br>
slurmd-FX14: _run_prolog: run job script took usec=3
<br>
slurmd-FX14: _run_prolog: prolog with lock for job 802 ran for 0
seconds
<br>
slurmd-FX14: debug3: _rpc_launch_tasks: call to
_forkexec_slurmstepd
<br>
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1
(FX12), children 0, depth 2, max_depth 2
<br>
slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
<br>
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
<br>
slurmd-FX14: debug3: _rpc_launch_tasks: return from
_forkexec_slurmstepd
<br>
slurmd-FX14: debug: task_p_slurmd_reserve_resources: 802 3
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6004
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
<br>
slurmd-FX14: debug: _rpc_signal_tasks: sending signal 995 to step
802.0 flag 0
<br>
slurmd-FX14: debug3: in the service_connection
<br>
slurmd-FX14: debug2: got this type of message 6011
<br>
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
<br>
slurmd-FX14: debug: _rpc_terminate_job, uid = 64030
<br>
slurmd-FX14: debug: task_p_slurmd_release_resources: 802
<br>
slurmd-FX14: debug: credential for job 802 revoked
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug2: container signal 18 to job 802.0
<br>
slurmd-FX14: debug: kill jobid=802 failed: No such process
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug2: container signal 15 to job 802.0
<br>
slurmd-FX14: debug: kill jobid=802 failed: No such process
<br>
slurmd-FX14: debug4: sent SUCCESS
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0
expires:0
<br>
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 57: ctime:1516360694
revoked:0 expires:0
<br>
<br>
[...]
<br>
<br>
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
<br>
slurmd-FX14: debug3: state for jobid 801: ctime:1516380131
revoked:0 expires:0
<br>
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202
revoked:1516380202 expires:1516380202
<br>
slurmd-FX14: debug3: state for jobid 802: ctime:1516380202
revoked:0 expires:0
<br>
slurmd-FX14: debug2: set revoke expiration for jobid 802 to
1516380326 UTS
<br>
slurmd-FX14: debug: Waiting for job 802's prolog to complete
<br>
slurmd-FX14: debug: Finished wait for job 802's prolog to
complete
<br>
slurmd-FX14: debug: completed epilog for jobid 802
<br>
slurmd-FX14: debug3: slurm_send_only_controller_msg: sent 190
<br>
slurmd-FX14: debug: Job 802: sent epilog complete msg: rc = 0
<br>
<br>
<b>while this is what triggered the transition on FX11:
</b><br>
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6011
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
<br>
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
<br>
slurmd-FX11: debug: task_p_slurmd_release_resources: 801
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 801, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 801.0
<br>
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
<br>
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:0 expires:0
<br>
slurmd-FX11: debug2: set revoke expiration for jobid 801 to
1516380322 UTS
<br>
slurmd-FX11: debug: Waiting for job 801's prolog to complete
<br>
slurmd-FX11: debug: Finished wait for job 801's prolog to
complete
<br>
slurmd-FX11: debug: completed epilog for jobid 801
<br>
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
<br>
slurmd-FX11: debug: Job 801: sent epilog complete msg: rc = 0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6001
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
<br>
slurmd-FX11: launch task 802.0 request from <a
class="moz-txt-link-abbreviated" href="mailto:0.0@192.168.6.1">0.0@192.168.6.1</a>
(port 10383)
<br>
slurmd-FX11: debug: Checking credential with 364 bytes of sig
data
<br>
slurmd-FX11: debug: task_p_slurmd_launch_request: 802.0 0
<br>
slurmd-FX11: _run_prolog: run job script took usec=4
<br>
slurmd-FX11: _run_prolog: prolog with lock for job 802 ran for 0
seconds
<br>
slurmd-FX11: debug3: _rpc_launch_tasks: call to
_forkexec_slurmstepd
<br>
slurmd-FX11: debug3: slurmstepd rank 0 (FX11), parent rank -1
(NONE), children 15, depth 0, max_depth 2
<br>
slurmd-FX11: debug3: _send_slurmstepd_init: call to getpwuid_r
<br>
slurmd-FX11: debug3: _send_slurmstepd_init: return from getpwuid_r
<br>
slurmd-FX11: debug3: _rpc_launch_tasks: return from
_forkexec_slurmstepd
<br>
slurmd-FX11: debug: task_p_slurmd_reserve_resources: 802 0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6004
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
<br>
slurmd-FX11: debug: _rpc_signal_tasks: sending signal 995 to step
802.0 flag 0
<br>
slurmd-FX11: debug3: in the service_connection
<br>
slurmd-FX11: debug2: got this type of message 6011
<br>
slurmd-FX11: debug2: Processing RPC: REQUEST_TERMINATE_JOB
<br>
slurmd-FX11: debug: _rpc_terminate_job, uid = 64030
<br>
slurmd-FX11: debug: task_p_slurmd_release_resources: 802
<br>
slurmd-FX11: debug: credential for job 802 revoked
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: container signal 18 to job 802.0
<br>
slurmd-FX11: debug: kill jobid=802 failed: No such process
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: container signal 15 to job 802.0
<br>
slurmd-FX11: debug: kill jobid=802 failed: No such process
<br>
slurmd-FX11: debug4: sent SUCCESS
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug4: found jobid = 802, stepid = 0
<br>
slurmd-FX11: debug2: terminate job step 802.0
<br>
slurmd-FX11: debug3: state for jobid 786: ctime:1516378107
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 788: ctime:1516378437
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 789: ctime:1516378508
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 790: ctime:1516378709
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 791: ctime:1516378780
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 792: ctime:1516378962
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 793: ctime:1516379034
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 794: ctime:1516379184
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 796: ctime:1516379297
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 797: ctime:1516379369
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 798: ctime:1516379701
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 799: ctime:1516379772
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 800: ctime:1516379920
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:1516380131 expires:1516380131
<br>
slurmd-FX11: debug3: state for jobid 801: ctime:1516380131
revoked:0 expires:0
<br>
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202
revoked:1516380202 expires:1516380202
<br>
slurmd-FX11: debug3: state for jobid 802: ctime:1516380202
revoked:0 expires:0
<br>
slurmd-FX11: debug2: set revoke expiration for jobid 802 to
1516380393 UTS
<br>
slurmd-FX11: debug: Waiting for job 802's prolog to complete
<br>
slurmd-FX11: debug: Finished wait for job 802's prolog to
complete
<br>
slurmd-FX11: debug: completed epilog for jobid 802
<br>
slurmd-FX11: debug3: slurm_send_only_controller_msg: sent 190
<br>
slurmd-FX11: debug: Job 802: sent epilog complete msg: rc = 0
<br>
<br>
As you can see, job 802 also gets stuck. Any idea on where I could
look for what is going wrong?
<br>
<br>
Best,
<br>
<br>
Julien Tailleur
<br>
</p>
</body>
</html>