[slurm-users] [EXTERNAL] Re: Slurm 21.08.8-2 upgrade

Fri May 6 07:49:10 UTC 2022

Hi Juergen,

Thanks for chiming in, for what it's worth we didn't set e CommunicationParameters=block_null_hash either, trying to keep it
simple and not introduce any further variables.

John

On 5/5/22, 7:57 PM, "Juergen Salk" <juergen.salk at uni-ulm.de> wrote:

    Hi John,

    this is really bad news. We have stopped our rolling update from Slurm
    21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
    compute nodes already running slurmd 21.08.8-1 suddenly started
    flapping between responding and not responding but all other nodes
    that were still running version 21.08.6 slurmd were not affected.

    For the affected nodes we did not see any obvious reason in slurmd.log
    even with SlurmdDebug set to debug3 but we noticed the following
    in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
    enabled.

    [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 RPC:REQUEST_PING : Protocol authentication error
    [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 RPC:REQUEST_PING : Protocol authentication error
    [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 RPC:REQUEST_PING : Protocol authentication error
    [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 RPC:REQUEST_PING : Protocol authentication error
    [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 RPC:REQUEST_PING : Protocol authentication error
    [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding

    So you seen this as well with 21.08.8-2?

    We didn't have CommunicationParameters=block_null_hash set, btw.

    Actually, after Tim's last announcement, I was hoping that we can start
    over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
    I would also be highly interested what others can say about rolling updates from
    Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
    mix of patched and unpatched slurmd versions on the compute nodes.

    If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
    we may have to drain the whole cluster for updating Slurm, which
    is something that I'd actually wished to avoid.

    Best regards
    Jürgen

    * Legato, John (NIH/NHLBI) [E] <legatoj at nhlbi.nih.gov> [220505 22:30]:
    > Hello,
    >
    > We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. We’ve upgraded the controller and a few partitions worth of nodes. We notice the nodes are
    > losing contact with the controller but slurmd is still up. We thought that this issue was fixed in -2 based on this bug report:
    >
    > https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14011&data=05%7C01%7Clegatoj%40nhlbi.nih.gov%7C5fffe960fad641dac2a608da2ef305f9%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637873918692159605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PzWzoJ1YrGvwPn0oJzeHtqp1Bd1%2FaCvHE8Qvlkwnub4%3D&reserved=0
    >
    > However we are still seeing the same behavior. I note that nodes running 21.08.6 are having no issues with communication. I could
    > upgrade the remaining 21.08.6 nodes but hesitate to do that as it seems like it would completely kill the functioning nodes.
    >
    > Is anyone else still seeing this in -2?
    >
    > Thanks
    >
    > John
    >
    >
    >
    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.