[slurm-users] slurmctld loose connection with slurmd for no reason after upgrading from 20.11.8 to 21.08.8-2

Fri May 20 12:52:46 UTC 2022

Hello Slurm community,

It seems that since we upgraded Slurm from 20.11.8 to 21.08.8-2 and since a user started a special type of job with a many short steps (ex: 1000) lasting from one to five minutes each, after many steps (283 in this case) slurmctld seems to loose contact with the slurmd running on the compute nodes for this job wrongly thinking that the node has resumed and exceded its ResumeTimeout of 300s (but this is not possible since the job was running since almost 24 hours on the same four nodes, no node "resumed"). Slurm then decide to mark the nodes as "down" and "power_save" and just after it notice that the four nodes used by the job are "now responding".

But too late ! The decision was taken, the job will soon be rescheduled on another set of nodes and the old nodes are marked "down~".

Note that on the compute nodes themselves everything appear normal. The only message I see is "credential for job 47647 revoked". The slurmd daemon is running fine according to systemctl and I have seen nothing in /var/log/message indicating an error (ex: communication error) on the head and compute nodes. Communication between slurmctld and slurmd is done over IPoIB interfaces (rather than GbE) and our Infiniband network seems fine.

In the last tree days this problem happened three times on our small 24 nodes cluster.

Am I the only one with this problem ?

In my opinion, this intermittent problem really look like a race or a communication bug.

Note that slurm.conf contains:

  CommunicationParameters=block_null_hash
  MaxStepCount=100000000
  SlurmctldDebug=debug
  SlurmdDebug=debug

Also the update was done cleanly by stopping all jobs using a maintenance reservation. The deamons were stopped and disabled. The rpms were build from sources. The old ones (20.11.8) were removed and the new ones installed (21.08.8-2). Finally the daeomons were started and enabled.

The job 47647 was allocated four nodes: cn[9-12]. Each step is short, use all CPUs allocated and run one after the other.

Thanks in advance,

Martin Audet

Here are the messages I had from slurmctld:

   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn9 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: requeue job JobId=47647 due to failure of node cn9
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Requeuing JobId=47647
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn10 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn11 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn12 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: sched: Updated reservation=limite users=root nodes=cn[1-24] cores=576 licenses=(null) tres=cpu=1152 watts=4294967294 start=2022-06-30T23:59:00 end=2022-07-01T00:00:00 MaxStartDelay=
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn10 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn11 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn12 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn9 now responding
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 WTERMSIG 15
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 cancelled by interactive user
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn12
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn10
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn11
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn9

Here are the messages I had from slurmd on cn9 (the other 3 show similar messages):

   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: launch task StepId=47647.283 request from UID:171971265 GID:171971265 HOST:10.2.1.109 PORT:43544
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Checking credential with 572 bytes of sig data
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: lllp_distribution: JobId=47647 binding: cores, dist 2
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [47647]: mask_cpu, 0x000001000001,0x001000001000,0x000002000002,0x002000002000,0x000004000004,0x004000004000,0x000008000008,0x008000008000,0x000010000010,0x010000010000,0x000020000020,0x020000020000,0x000040000040,0x040000040000,0x000080000080,0x080000080000,0x000100000100,0x100000100000,0x000200000200,0x200000200000,0x000400000400,0x400000400000,0x000800000800,0x800000800000
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Waiting for job 47647's prolog to complete
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Finished wait for job 47647's prolog to complete
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  _rpc_terminate_job: uid = 1001 JobId=47647
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  credential for job 47647 revoked
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Waiting for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Finished wait for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  completed epilog for jobid 47647
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  JobId=47647: sent epilog complete msg: rc = 0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220520/82b9363b/attachment.htm>