[slurm-users] slumctld don't start at boot

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 23 10:52:52 UTC 2021


On 7/23/21 12:29 PM, Riccardo Sucapane wrote:
> I am using Slurm as a workload manager on a system
> with a master and 3 nodes.
> The operating system used is the recent rocky linux 8.4
> while for slurm, is used the version 20.11.8 taken from EPEL
> repository.
> Everything works correctly and when the system is started the command
> "systemctl start slurmctld" works fine, but at boot the daemon
> slurmctld does not start on the master machine, reporting a series of errors.
> Without reporting all the slurmctld.log the recurring error is the following:
> 
> [2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: Name 
> or service not known
> [2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve "blade01"
> [2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' not 
> supported
> [2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01

This seems to be a DNS name resolution error.

This could be due to slurmctld starting before the server's network is 
completely up!  We have seen this with slurmd on EL 8.4 nodes, and I found 
a solution, see https://bugs.schedmd.com/show_bug.cgi?id=11878#c5.  This 
will be fixed in Slurm 21.08.

In /usr/lib/systemd/system/slurmd.service and 
/usr/lib/systemd/system/slurmctld.service you should replace 
"network.target" by "network-online.target".  Reboot to test it.

> In this case I have set it in the slurm.conf file, for simplicity,
> "AccountingStorageType=accounting_storage/none", but also using the
> slurmdbd/mariadb support is all right with no problems, but slurmctld
> still does not start on boot.
> Also in the log reported blade01 is the hostname of one of the nodes.

You should probably fix /usr/lib/systemd/system/slurmdbd.service as well.

/Ole



More information about the slurm-users mailing list