[slurm-users] [EXTERNAL] Re: Slurm 21.08.8-2 upgrade

Legato, John (NIH/NHLBI) [E] legatoj at nhlbi.nih.gov
Fri May 6 15:55:26 UTC 2022


I am going to upgrade all nodes in one shot on Monday, concluding that rolling upgrades are broken with these changes.

John


On 5/6/22, 11:51 AM, "slurm-users on behalf of Paul Edmon" <slurm-users-bounces at lists.schedmd.com on behalf of pedmon at cfa.harvard.edu> wrote:

    We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we
    saw the communications issues described by Tim W.  We upgraded to
    21.08.8-2 this morning and that did the trick to resolve all the
    communications problems we were having.

    -Paul Edmon-

    On 5/6/2022 4:38 AM, Ole Holm Nielsen wrote:
    > Hi Juergen,
    >
    > My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for
    > the entire cluster, and we didn't have any issues.  I built RPMs from
    > the tar-ball and simply did "yum update" on the nodes (one partition
    > at a time) while the cluster was running in full production mode.  All
    > slurmd get restarted during the yum update, and this happens within
    > 1-2 minutes per partition.
    >
    > Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster,
    > and again we have not seen any issues.
    >
    > We also do *not* setting CommunicationParameters=block_null_hash until
    > a later date when there are no more old versions of slurmstepd
    > running.  We did however see RPC errors with "Protocol authentication
    > error" while block_null_hash was enabled briefly, see
    > https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14002&data=05%7C01%7Clegatoj%40nhlbi.nih.gov%7Ca0272deed11141a2807f08da2f784fbd%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637874491042724073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nBddIiIJ4qhjlSd2yE8UTPCctzjSMovKaYR4jZVBkRw%3D&reserved=0, and so we turned it
    > off again.  It hasn't happened since.
    >
    > Best regards,
    > Ole
    >
    > On 5/6/22 01:57, Juergen Salk wrote:
    >> Hi John,
    >>
    >> this is really bad news. We have stopped our rolling update from Slurm
    >> 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
    >> compute nodes already running slurmd 21.08.8-1 suddenly started
    >> flapping between responding and not responding but all other nodes
    >> that were still running version 21.08.6 slurmd were not affected.
    >>
    >> For the affected nodes we did not see any obvious reason in slurmd.log
    >> even with SlurmdDebug set to debug3 but we noticed the following
    >> in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
    >> enabled.
    >>
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811
    >> RPC:REQUEST_PING : Protocol authentication error
    >> [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding
    >>
    >> So you seen this as well with 21.08.8-2?
    >>
    >> We didn't have CommunicationParameters=block_null_hash set, btw.
    >>
    >> Actually, after Tim's last announcement, I was hoping that we can start
    >> over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
    >> I would also be highly interested what others can say about rolling
    >> updates from
    >> Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
    >> mix of patched and unpatched slurmd versions on the compute nodes.
    >>
    >> If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
    >> we may have to drain the whole cluster for updating Slurm, which
    >> is something that I'd actually wished to avoid.
    >>
    >> Best regards
    >> Jürgen
    >>
    >>
    >>
    >> * Legato, John (NIH/NHLBI) [E] <legatoj at nhlbi.nih.gov> [220505 22:30]:
    >>> Hello,
    >>>
    >>> We are in the process of upgrading from Slurm 21.08.6 to Slurm
    >>> 21.08.8-2. We’ve upgraded the controller and a few partitions worth
    >>> of nodes. We notice the nodes are
    >>> losing contact with the controller but slurmd is still up. We
    >>> thought that this issue was fixed in -2 based on this bug report:
    >>>
    >>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14011&data=05%7C01%7Clegatoj%40nhlbi.nih.gov%7Ca0272deed11141a2807f08da2f784fbd%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637874491042724073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mCdg91akV8igiVk3sed%2BZnI1Hokvvbfsf1Kf0O%2FIHwE%3D&reserved=0
    >>>
    >>> However we are still seeing the same behavior. I note that nodes
    >>> running 21.08.6 are having no issues with communication. I could
    >>> upgrade the remaining 21.08.6 nodes but hesitate to do that as it
    >>> seems like it would completely kill the functioning nodes.
    >>>
    >>> Is anyone else still seeing this in -2?
    >

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.




More information about the slurm-users mailing list