[slurm-users] slumctld don't start at boot
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 23 10:52:52 UTC 2021
On 7/23/21 12:29 PM, Riccardo Sucapane wrote:
> I am using Slurm as a workload manager on a system
> with a master and 3 nodes.
> The operating system used is the recent rocky linux 8.4
> while for slurm, is used the version 20.11.8 taken from EPEL
> Everything works correctly and when the system is started the command
> "systemctl start slurmctld" works fine, but at boot the daemon
> slurmctld does not start on the master machine, reporting a series of errors.
> Without reporting all the slurmctld.log the recurring error is the following:
> [2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve "blade01"
> [2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' not
> [2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01
This seems to be a DNS name resolution error.
This could be due to slurmctld starting before the server's network is
completely up! We have seen this with slurmd on EL 8.4 nodes, and I found
a solution, see https://bugs.schedmd.com/show_bug.cgi?id=11878#c5. This
will be fixed in Slurm 21.08.
In /usr/lib/systemd/system/slurmd.service and
/usr/lib/systemd/system/slurmctld.service you should replace
"network.target" by "network-online.target". Reboot to test it.
> In this case I have set it in the slurm.conf file, for simplicity,
> "AccountingStorageType=accounting_storage/none", but also using the
> slurmdbd/mariadb support is all right with no problems, but slurmctld
> still does not start on boot.
> Also in the log reported blade01 is the hostname of one of the nodes.
You should probably fix /usr/lib/systemd/system/slurmdbd.service as well.
More information about the slurm-users