[slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

Thu May 5 13:43:14 UTC 2022

Hi Marcus,

On 5/5/22 14:45, Marcus Boden wrote:
> we had a similar issues on our systems. As I understand from the bug you 
> linked, we just need to wait until all the old jobs are finished (and the 
> old slurmstepd are gone). So a full drain should not be necessary?

Yes, I believe that sounds right.

I've been thinking about how to determine the timestamp of the oldest job 
running on the cluster, and then make sure this is after the time that all 
slurmd daemons were upgraded to 21.08.8.

This command will tell you the oldest running jobs:

$ squeue -t running -O StartTime | sort | head

You can add more -O options to get JobIDs etc., as long as you sort on the 
StartTime column (Slurm ISO 8601 timestamps[1] can simply be sorted in 
lexicographical order).

I hope this helps.

/Ole

[1] https://en.wikipedia.org/wiki/ISO_8601

> On 05.05.22 13:53, Ole Holm Nielsen wrote:
>> Just a heads-up regarding setting 
>> CommunicationParameters=block_null_hash in slurm.conf:
>>
>> On 5/4/22 21:50, Tim Wickberg wrote:
>>> CVE-2022-29500:
>>>
>>> An architectural flaw with how credentials are handled can be exploited 
>>> to allow an unprivileged user to impersonate the SlurmUser account. 
>>> Access to the SlurmUser account can be used to execute arbitrary 
>>> processes as root.
>>>
>>> This issue impacts all Slurm releases since at least Slurm 1.0.0.
>>>
>>> Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd 
>>> processes have been restarted in the cluster.
>>>
>>> Once all daemons have been upgraded sites are encouraged to add 
>>> "block_null_hash" to CommunicationParameters. That new option provides 
>>> additional protection against a potential exploit.
>>
>> The block_null_hash still needs to be documented in the slurm.conf 
>> man-page.  But in https://bugs.schedmd.com/show_bug.cgi?id=14002 I was 
>> assured that it's OK to use it now.
>>
>> I upgraded 21.08.7 to 21.08.8 using RPM packages while the cluster was 
>> running production jobs.  This is perhaps not recommended (see 
>> https://slurm.schedmd.com/quickstart_admin.html#upgrade), but it worked 
>> without a glitch also in this case.
>>
>> However, when I defined CommunicationParameters=block_null_hash in 
>> slurm.conf later today, I started getting RPC errors on the compute 
>> nodes and in slurmctld when jobs were completing, see bug 14002.
>>
>> I would recommend sites to hold up a bit with 
>> CommunicationParameters=block_null_hash until we have found a resolution 
>> in bug 14002.  Draining all jobs from the cluster before setting this 
>> parameter may be the safe approach(?).