[slurm-users] slurmctld daemon error

Luke Yeager lyeager at nvidia.com
Tue Dec 15 01:03:45 UTC 2020


What does your ‘slurmctld.service’ look like? You might want to add something to the ‘After=’ section if your service is starting too quickly.

e.g. we use ‘After=network.target munge.service’ (see here<https://github.com/NVIDIA/nephele-packages/blob/30bc321c311398cc7a86485bc88930e4b6790fb4/slurm/debian/PACKAGE-control.slurmctld.service#L3>).

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Alpha Experiment
Sent: Monday, December 14, 2020 4:20 PM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] slurmctld daemon error

External email: Use caution opening links or attachments

Hi,

I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors.
[admin at localhost ~]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)
     Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
   Main PID: 2363 (slurmd)
      Tasks: 2
     Memory: 3.4M
        CPU: 211ms
     CGroup: /system.slice/slurmd.service
             └─2363 /usr/local/sbin/slurmd -D
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon.
[admin at localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─override.conf
     Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST; 11min ago
    Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 1972 (code=exited, status=1/FAILURE)
        CPU: 21ms
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller daemon.
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed with result 'exit-code'.

The slurmctld log is as follows:
[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster cluster
[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve "localhost"
[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not supported
[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
[2020-12-14T16:02:12.772] Recovered state of 1 nodes
[2020-12-14T16:02:12.772] Recovered information about 0 jobs
[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Recovered state of 0 reservations
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Running as primary controller
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port without address family
[2020-12-14T16:02:12.782] error: Error creating slurm stream socket: Address family not supported by protocol
[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address family not supported by protocol

Strangely, the daemon works fine when it is rebooted. After running
systemctl restart slurmctld.service

the service status is
[admin at localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─override.conf
     Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
   Main PID: 2815 (slurmctld)
      Tasks: 7
     Memory: 1.9M
        CPU: 15ms
     CGroup: /system.slice/slurmctld.service
             └─2815 /usr/local/sbin/slurmctld -D
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller daemon.

Could anyone point me towards how to fix this? I expect it's just an issue with my configuration file, which I've copied below for reference.
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/home/slurm/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/home/slurm/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd/
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/home/slurm/spool/
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=info
SlurmctldLogFile=/home/slurm/log/slurmctld.log
#SlurmdDebug=info
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP

Thanks!
-John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201215/62d9727e/attachment-0001.htm>


More information about the slurm-users mailing list