[slurm-users] slurmctld loose connection with slurmd for no reason after upgrading from 20.11.8 to 21.08.8-2

Fri Jun 3 14:16:05 UTC 2022

Hello Slurm user community,

I would like to share my experience concerning the updates I did following the security fixes published last month (May 4th) as it may help other users (and hopefully attract attention of responsible developers).

As I explained in my previous message, we were running Slurm 20.11.8 since July 2021 without any problems.

After security patches were published we decided at the same time to update to the latest version (at that time), that is: 21.08.8-2. We then ran into problems with jobs composed of a large number of steps (ex: 283). We tried modifying the configuration without success. We then decided to go back to a previous version (a version not vulnerable of course), 20.11.9 using a conservative configuration, and it worked ! We then modified the configuration to use a parameter recommended for security and it continued working.

Here is a summary of what's working and what's not working:

20.11.7 (of course without CommunicationParameters=block_null_hash): working
21.08.8-2 with CommunicationParameters=block_null_hash: intermittent problems
21.08.8-2 without CommunicationParameters=block_null_hash: intermittent problems
20.11.8 without CommunicationParameters=block_null_hash: working
20.11.8 with CommunicationParameters=block_null_hash: working

So from the symptoms I described in my previous message, the intermittent problems  really look like a buggy behavior (race condition or communication problem). And from the result above this bug looks like a *regression* from 20.11.[7-8] to 21.08.8-2.

To Slurm developers:

I know we are not paying for maintenance. I would like to see my organisation support a useful and good open source project such as Slurm instead of buying yet another Windows centered commercial software. But with our small 24 nodes cluster the support fees would be too high for our operational budget. If I were you I would still try to investigate this problem as it of general interest: it may allow to find a real regression bug affecting many users (both with and without support contract).

Thanks,

Martin Audet

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Audet, Martin
Sent: May 20, 2022 8:53
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] slurmctld loose connection with slurmd for no reason after upgrading from 20.11.8 to 21.08.8-2

Hello Slurm community,

It seems that since we upgraded Slurm from 20.11.8 to 21.08.8-2 and since a user started a special type of job with a many short steps (ex: 1000) lasting from one to five minutes each, after many steps (283 in this case) slurmctld seems to loose contact with the slurmd running on the compute nodes for this job wrongly thinking that the node has resumed and exceded its ResumeTimeout of 300s (but this is not possible since the job was running since almost 24 hours on the same four nodes, no node "resumed"). Slurm then decide to mark the nodes as "down" and "power_save" and just after it notice that the four nodes used by the job are "now responding".

But too late ! The decision was taken, the job will soon be rescheduled on another set of nodes and the old nodes are marked "down~".

Note that on the compute nodes themselves everything appear normal. The only message I see is "credential for job 47647 revoked". The slurmd daemon is running fine according to systemctl and I have seen nothing in /var/log/message indicating an error (ex: communication error) on the head and compute nodes. Communication between slurmctld and slurmd is done over IPoIB interfaces (rather than GbE) and our Infiniband network seems fine.

In the last tree days this problem happened three times on our small 24 nodes cluster.

Am I the only one with this problem ?

In my opinion, this intermittent problem really look like a race or a communication bug.

Note that slurm.conf contains:

  CommunicationParameters=block_null_hash
  MaxStepCount=100000000
  SlurmctldDebug=debug
  SlurmdDebug=debug

Also the update was done cleanly by stopping all jobs using a maintenance reservation. The deamons were stopped and disabled. The rpms were build from sources. The old ones (20.11.8) were removed and the new ones installed (21.08.8-2). Finally the daeomons were started and enabled.

The job 47647 was allocated four nodes: cn[9-12]. Each step is short, use all CPUs allocated and run one after the other.

Thanks in advance,

Martin Audet

Here are the messages I had from slurmctld:

   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn9 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: requeue job JobId=47647 due to failure of node cn9
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Requeuing JobId=47647
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn10 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn11 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn12 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: sched: Updated reservation=limite users=root nodes=cn[1-24] cores=576 licenses=(null) tres=cpu=1152 watts=4294967294 start=2022-06-30T23:59:00 end=2022-07-01T00:00:00 MaxStartDelay=
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn10 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn11 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn12 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn9 now responding
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 WTERMSIG 15
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 cancelled by interactive user
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn12
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn10
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn11
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn9

Here are the messages I had from slurmd on cn9 (the other 3 show similar messages):

   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: launch task StepId=47647.283 request from UID:171971265 GID:171971265 HOST:10.2.1.109 PORT:43544
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Checking credential with 572 bytes of sig data
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: lllp_distribution: JobId=47647 binding: cores, dist 2
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [47647]: mask_cpu, 0x000001000001,0x001000001000,0x000002000002,0x002000002000,0x000004000004,0x004000004000,0x000008000008,0x008000008000,0x000010000010,0x010000010000,0x000020000020,0x020000020000,0x000040000040,0x040000040000,0x000080000080,0x080000080000,0x000100000100,0x100000100000,0x000200000200,0x200000200000,0x000400000400,0x400000400000,0x000800000800,0x800000800000
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Waiting for job 47647's prolog to complete
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Finished wait for job 47647's prolog to complete
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  _rpc_terminate_job: uid = 1001 JobId=47647
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  credential for job 47647 revoked
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Waiting for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  Finished wait for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  completed epilog for jobid 47647
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  JobId=47647: sent epilog complete msg: rc = 0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220603/0475479a/attachment-0001.htm>