[slurm-users] Slurm 21.08.8-2 upgrade

Paul Edmon pedmon at cfa.harvard.edu
Fri May 6 15:39:24 UTC 2022


We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we 
saw the communications issues described by Tim W.  We upgraded to 
21.08.8-2 this morning and that did the trick to resolve all the 
communications problems we were having.

-Paul Edmon-

On 5/6/2022 4:38 AM, Ole Holm Nielsen wrote:
> Hi Juergen,
>
> My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for 
> the entire cluster, and we didn't have any issues.  I built RPMs from 
> the tar-ball and simply did "yum update" on the nodes (one partition 
> at a time) while the cluster was running in full production mode.  All 
> slurmd get restarted during the yum update, and this happens within 
> 1-2 minutes per partition.
>
> Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster, 
> and again we have not seen any issues.
>
> We also do *not* setting CommunicationParameters=block_null_hash until 
> a later date when there are no more old versions of slurmstepd 
> running.  We did however see RPC errors with "Protocol authentication 
> error" while block_null_hash was enabled briefly, see 
> https://bugs.schedmd.com/show_bug.cgi?id=14002, and so we turned it 
> off again.  It hasn't happened since.
>
> Best regards,
> Ole
>
> On 5/6/22 01:57, Juergen Salk wrote:
>> Hi John,
>>
>> this is really bad news. We have stopped our rolling update from Slurm
>> 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
>> compute nodes already running slurmd 21.08.8-1 suddenly started
>> flapping between responding and not responding but all other nodes
>> that were still running version 21.08.6 slurmd were not affected.
>>
>> For the affected nodes we did not see any obvious reason in slurmd.log
>> even with SlurmdDebug set to debug3 but we noticed the following
>> in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
>> enabled.
>>
>> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 
>> RPC:REQUEST_PING : Protocol authentication error
>> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 
>> RPC:REQUEST_PING : Protocol authentication error
>> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 
>> RPC:REQUEST_PING : Protocol authentication error
>> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 
>> RPC:REQUEST_PING : Protocol authentication error
>> [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 
>> RPC:REQUEST_PING : Protocol authentication error
>> [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding
>>
>> So you seen this as well with 21.08.8-2?
>>
>> We didn't have CommunicationParameters=block_null_hash set, btw.
>>
>> Actually, after Tim's last announcement, I was hoping that we can start
>> over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
>> I would also be highly interested what others can say about rolling 
>> updates from
>> Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
>> mix of patched and unpatched slurmd versions on the compute nodes.
>>
>> If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
>> we may have to drain the whole cluster for updating Slurm, which
>> is something that I'd actually wished to avoid.
>>
>> Best regards
>> Jürgen
>>
>>
>>
>> * Legato, John (NIH/NHLBI) [E] <legatoj at nhlbi.nih.gov> [220505 22:30]:
>>> Hello,
>>>
>>> We are in the process of upgrading from Slurm 21.08.6 to Slurm 
>>> 21.08.8-2. We’ve upgraded the controller and a few partitions worth 
>>> of nodes. We notice the nodes are
>>> losing contact with the controller but slurmd is still up. We 
>>> thought that this issue was fixed in -2 based on this bug report:
>>>
>>> https://bugs.schedmd.com/show_bug.cgi?id=14011
>>>
>>> However we are still seeing the same behavior. I note that nodes 
>>> running 21.08.6 are having no issues with communication. I could
>>> upgrade the remaining 21.08.6 nodes but hesitate to do that as it 
>>> seems like it would completely kill the functioning nodes.
>>>
>>> Is anyone else still seeing this in -2?
>



More information about the slurm-users mailing list