[slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Oct 28 06:59:38 UTC 2022
On 10/28/22 08:30, Richard Chang wrote:
> Yes, the system is a HPE Cray EX, and I am trying to use
> switch/hpe_slingshot.
I see that Slurm 22.05 has added support for "switch/hpe_slingshot" with
HPE Slingshot systems:
> SwitchType
> Identifies the type of switch or interconnect used for application
communications. Acceptable values include "switch/cray_aries" for Cray
systems, "switch/hpe_slingshot" for HPE Slingshot systems and
"switch/none" for switches not requiring special processing for job launch
or termination (Ethernet, and InfiniBand). The default value is
"switch/none". All Slurm daemons, commands and running jobs must be
restarted for a change in SwitchType to take effect. If running jobs exist
at the time slurmctld is restarted with a new value of SwitchType, records
of all jobs in any state may be lost.
You probably need to contact your HPE support people. A support contract
with SchedMD is highly recommended when you have a complex setup with very
new technology. See https://www.schedmd.com/support.php
/Ole
> On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote:
>> On 10/28/22 07:35, Richard Chang wrote:
>>> I have observed that when I specify a switch type in the slurm.conf
>>> file and that particular switch type is not present in the slurmctld
>>> node, slurmctld panics and shuts down. Is this expected ? My slurmctld
>>> doesn't have the switch type, but the computes have that switch type.
>>> how can I set it up so that it can utilise the feature but not break
>>> slurm.
>>
>> What is you line in slurm.conf? The manual page seems to describe what
>> you have observed:
>>
>> SwitchType
>> Identifies the type of switch or interconnect used for
>> applica‐
>> tion communications. Acceptable values include
>> "switch/cray_aries" for Cray systems, "switch/none" for
>> switches
>> not requiring special processing for job launch or
>> termination
>> (Ethernet, and InfiniBand) and The default value is
>> "switch/none". All Slurm daemons, commands and
>> running jobs
>> must be restarted for a change in SwitchType to take
>> effect. If
>> running jobs exist at the time slurmctld is restarted with
>> a new
>> value of SwitchType, records of all jobs in any state
>> may be
>> lost.
>>
>> Why do you want to use this configuration? Is your system a Cray?
More information about the slurm-users
mailing list