[slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Oct 28 06:59:38 UTC 2022


On 10/28/22 08:30, Richard Chang wrote:
> Yes, the system is a HPE Cray EX, and I am trying to use 
> switch/hpe_slingshot.

I see that Slurm 22.05 has added support for "switch/hpe_slingshot" with 
HPE Slingshot systems:

 > SwitchType
 >     Identifies the type of switch or interconnect used for application 
communications. Acceptable values include "switch/cray_aries" for Cray 
systems, "switch/hpe_slingshot" for HPE Slingshot systems and 
"switch/none" for switches not requiring special processing for job launch 
or termination (Ethernet, and InfiniBand). The default value is 
"switch/none". All Slurm daemons, commands and running jobs must be 
restarted for a change in SwitchType to take effect. If running jobs exist 
at the time slurmctld is restarted with a new value of SwitchType, records 
of all jobs in any state may be lost.

You probably need to contact your HPE support people.  A support contract 
with SchedMD is highly recommended when you have a complex setup with very 
new technology.  See https://www.schedmd.com/support.php

/Ole

> On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote:
>> On 10/28/22 07:35, Richard Chang wrote:
>>> I have observed that when I specify a switch type in the slurm.conf 
>>> file and that particular switch type is not present in the slurmctld 
>>> node, slurmctld panics and shuts down. Is this expected ? My slurmctld 
>>> doesn't have the switch type, but the computes have that switch type. 
>>> how can I set it up so that it can utilise the feature but not break 
>>> slurm.
>>
>> What is you line in slurm.conf?  The manual page seems to describe what 
>> you have observed:
>>
>> SwitchType
>>               Identifies the type of switch or interconnect used for 
>> applica‐
>>               tion      communications.      Acceptable     values include
>>               "switch/cray_aries" for Cray systems, "switch/none" for 
>> switches
>>               not  requiring  special processing for job launch or 
>> termination
>>               (Ethernet,  and   InfiniBand)   and   The default value   is
>>               "switch/none".   All  Slurm  daemons,  commands and 
>> running jobs
>>               must be restarted for a change in SwitchType to take 
>> effect.  If
>>               running jobs exist at the time slurmctld is restarted with 
>> a new
>>               value of SwitchType, records of all jobs in  any state 
>> may  be
>>               lost.
>>
>> Why do you want to use this configuration?  Is your system a Cray?



More information about the slurm-users mailing list