[slurm-users] Slurm 21.08.8-2 upgrade

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri May 6 08:38:00 UTC 2022


Hi Juergen,

My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for the 
entire cluster, and we didn't have any issues.  I built RPMs from the 
tar-ball and simply did "yum update" on the nodes (one partition at a 
time) while the cluster was running in full production mode.  All slurmd 
get restarted during the yum update, and this happens within 1-2 minutes 
per partition.

Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster, and 
again we have not seen any issues.

We also do *not* setting CommunicationParameters=block_null_hash until a 
later date when there are no more old versions of slurmstepd running.  We 
did however see RPC errors with "Protocol authentication error" while 
block_null_hash was enabled briefly, see 
https://bugs.schedmd.com/show_bug.cgi?id=14002, and so we turned it off 
again.  It hasn't happened since.

Best regards,
Ole

On 5/6/22 01:57, Juergen Salk wrote:
> Hi John,
> 
> this is really bad news. We have stopped our rolling update from Slurm
> 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
> compute nodes already running slurmd 21.08.8-1 suddenly started
> flapping between responding and not responding but all other nodes
> that were still running version 21.08.6 slurmd were not affected.
> 
> For the affected nodes we did not see any obvious reason in slurmd.log
> even with SlurmdDebug set to debug3 but we noticed the following
> in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
> enabled.
> 
> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 RPC:REQUEST_PING : Protocol authentication error
> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 RPC:REQUEST_PING : Protocol authentication error
> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 RPC:REQUEST_PING : Protocol authentication error
> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 RPC:REQUEST_PING : Protocol authentication error
> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 RPC:REQUEST_PING : Protocol authentication error
> [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding
> 
> So you seen this as well with 21.08.8-2?
> 
> We didn't have CommunicationParameters=block_null_hash set, btw.
> 
> Actually, after Tim's last announcement, I was hoping that we can start
> over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
> I would also be highly interested what others can say about rolling updates from
> Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
> mix of patched and unpatched slurmd versions on the compute nodes.
> 
> If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
> we may have to drain the whole cluster for updating Slurm, which
> is something that I'd actually wished to avoid.
> 
> Best regards
> Jürgen
> 
> 
> 
> * Legato, John (NIH/NHLBI) [E] <legatoj at nhlbi.nih.gov> [220505 22:30]:
>> Hello,
>>
>> We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. We’ve upgraded the controller and a few partitions worth of nodes. We notice the nodes are
>> losing contact with the controller but slurmd is still up. We thought that this issue was fixed in -2 based on this bug report:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=14011
>>
>> However we are still seeing the same behavior. I note that nodes running 21.08.6 are having no issues with communication. I could
>> upgrade the remaining 21.08.6 nodes but hesitate to do that as it seems like it would completely kill the functioning nodes.
>>
>> Is anyone else still seeing this in -2?



More information about the slurm-users mailing list