[slurm-users] slumctld don't start at boot

Diego Zuccato diego.zuccato at unibo.it
Fri Jul 23 11:00:59 UTC 2021


We answered in parallel :)
I usually prefer to avoid modifying system-managed files because system 
updates could reset 'em. Since systemd allows overrides, I chose to use 
'em :)

Il 23/07/2021 12:52, Ole Holm Nielsen ha scritto:
> On 7/23/21 12:29 PM, Riccardo Sucapane wrote:
>> I am using Slurm as a workload manager on a system
>> with a master and 3 nodes.
>> The operating system used is the recent rocky linux 8.4
>> while for slurm, is used the version 20.11.8 taken from EPEL
>> repository.
>> Everything works correctly and when the system is started the command
>> "systemctl start slurmctld" works fine, but at boot the daemon
>> slurmctld does not start on the master machine, reporting a series of 
>> errors.
>> Without reporting all the slurmctld.log the recurring error is the 
>> following:
>>
>> [2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: 
>> Name or service not known
>> [2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve 
>> "blade01"
>> [2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' 
>> not supported
>> [2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01
> 
> This seems to be a DNS name resolution error.
> 
> This could be due to slurmctld starting before the server's network is 
> completely up!  We have seen this with slurmd on EL 8.4 nodes, and I 
> found a solution, see 
> https://bugs.schedmd.com/show_bug.cgi?id=11878#c5.  This will be fixed 
> in Slurm 21.08.
> 
> In /usr/lib/systemd/system/slurmd.service and 
> /usr/lib/systemd/system/slurmctld.service you should replace 
> "network.target" by "network-online.target".  Reboot to test it.
> 
>> In this case I have set it in the slurm.conf file, for simplicity,
>> "AccountingStorageType=accounting_storage/none", but also using the
>> slurmdbd/mariadb support is all right with no problems, but slurmctld
>> still does not start on boot.
>> Also in the log reported blade01 is the hostname of one of the nodes.
> 
> You should probably fix /usr/lib/systemd/system/slurmdbd.service as well.
> 
> /Ole
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list