[slurm-users] [EXTERNAL] Re: Slurm 21.08.8-2 upgrade

Fri May 6 15:55:26 UTC 2022

I am going to upgrade all nodes in one shot on Monday, concluding that rolling upgrades are broken with these changes.

John

On 5/6/22, 11:51 AM, "slurm-users on behalf of Paul Edmon" <slurm-users-bounces at lists.schedmd.com on behalf of pedmon at cfa.harvard.edu> wrote:

    We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we
    saw the communications issues described by Tim W.  We upgraded to
    21.08.8-2 this morning and that did the trick to resolve all the
    communications problems we were having.

    -Paul Edmon-

    On 5/6/2022 4:38 AM, Ole Holm Nielsen wrote:
    > Hi Juergen,
    >
    > My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for
    > the entire cluster, and we didn't have any issues.  I built RPMs from
    > the tar-ball and simply did "yum update" on the nodes (one partition
    > at a time) while the cluster was running in full production mode.  All
    > slurmd get restarted during the yum update, and this happens within
    > 1-2 minutes per partition.
    >
    > Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster,
    > and again we have not seen any issues.
    >
    > We also do *not* setting CommunicationParameters=block_null_hash until
    > a later date when there are no more old versions of slurmstepd
    > running.  We did however see RPC errors with "Protocol authentication
    > error" while block_null_hash was enabled briefly, see
    > https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14002&data=05%7C01%7Clegatoj%40nhlbi.nih.gov%7Ca0272deed11141a2807f08da2f784fbd%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637874491042724073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nBddIiIJ4qhjlSd2yE8UTPCctzjSMovKaYR4jZVBkRw%3D&reserved=0, and so we turned it
    > off again.  It hasn't happened since.
    >
    > Best regards,
    > Ole
    >
    > On 5/6/22 01:57, Juergen Salk wrote:
    >> Hi John,
    >>
    >> this is really bad news. We have stopped our rolling update from Slurm
    >> 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
    >> compute nodes already running slurmd 21.08.8-1 suddenly started
    >> flapping between responding and not responding but all other nodes
    >> that were still running version 21.08.6 slurmd were not affected.
    >>
    >> For the affected nodes we did not see any obvious reason in slurmd.log
    >> even with SlurmdDebug set to debug3 but we noticed the following
    >> in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
    >> enabled.
    >>
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding
    >>
    >> So you seen this as well with 21.08.8-2?
    >>
    >> We didn't have CommunicationParameters=block_null_hash set, btw.
    >>
    >> Actually, after Tim's last announcement, I was hoping that we can start
    >> over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
    >> I would also be highly interested what others can say about rolling
    >> updates from
    >> Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
    >> mix of patched and unpatched slurmd versions on the compute nodes.
    >>
    >> If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
    >> we may have to drain the whole cluster for updating Slurm, which
    >> is something that I'd actually wished to avoid.
    >>
    >> Best regards
    >> Jürgen
    >>
    >>
    >>
    >> * Legato, John (NIH/NHLBI) [E] <legatoj at nhlbi.nih.gov> [220505 22:30]:
    >>> Hello,
    >>>
    >>> We are in the process of upgrading from Slurm 21.08.6 to Slurm
    >>> 21.08.8-2. We’ve upgraded the controller and a few partitions worth
    >>> of nodes. We notice the nodes are
    >>> losing contact with the controller but slurmd is still up. We
    >>> thought that this issue was fixed in -2 based on this bug report:
    >>>
    >>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14011&data=05%7C01%7Clegatoj%40nhlbi.nih.gov%7Ca0272deed11141a2807f08da2f784fbd%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637874491042724073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mCdg91akV8igiVk3sed%2BZnI1Hokvvbfsf1Kf0O%2FIHwE%3D&reserved=0
    >>>
    >>> However we are still seeing the same behavior. I note that nodes
    >>> running 21.08.6 are having no issues with communication. I could
    >>> upgrade the remaining 21.08.6 nodes but hesitate to do that as it
    >>> seems like it would completely kill the functioning nodes.
    >>>
    >>> Is anyone else still seeing this in -2?
    >

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.