[slurm-users] slurmctld daemon error

Avery Grieve agrieve at umich.edu
Tue Dec 15 02:07:18 UTC 2020


Hey Luke, I'm getting the same issues with my slurmctld daemon not starting
on boot (as well as my slurmd daemon). Both fail with the same messages
John got above (just Exit Code).

My slurmctld service file in /etc/systemd/system/ looks like this:

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Similar to John, my daemon starts if I just run the systemctl start command
following boot.

~Avery Grieve
They/Them/Theirs please!
University of Michigan


On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager <lyeager at nvidia.com> wrote:

> What does your ‘slurmctld.service’ look like? You might want to add
> something to the ‘After=’ section if your service is starting too quickly.
>
>
>
> e.g. we use ‘After=network.target munge.service’ (see here
> <https://github.com/NVIDIA/nephele-packages/blob/30bc321c311398cc7a86485bc88930e4b6790fb4/slurm/debian/PACKAGE-control.slurmctld.service#L3>).
>
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Alpha Experiment
> *Sent:* Monday, December 14, 2020 4:20 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [slurm-users] slurmctld daemon error
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi,
>
>
>
> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
> running correctly; however the slurmctld daemon always errors.
>
> [admin at localhost ~]$ systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
>      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>      Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
>    Main PID: 2363 (slurmd)
>       Tasks: 2
>      Memory: 3.4M
>         CPU: 211ms
>      CGroup: /system.slice/slurmd.service
>              └─2363 /usr/local/sbin/slurmd -D
> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
> daemon.
>
> [admin at localhost ~]$ systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
>     Drop-In: /etc/systemd/system/slurmctld.service.d
>              └─override.conf
>      Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST;
> 11min ago
>     Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>    Main PID: 1972 (code=exited, status=1/FAILURE)
>         CPU: 21ms
> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
> process exited, code=exited, status=1/FAILURE
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> Failed with result 'exit-code'.
>
>
>
> The slurmctld log is as follows:
>
> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
> cluster
> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
> "localhost"
> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
> supported
> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
> [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Running as primary controller
> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
> without address family
> [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
> Address family not supported by protocol
>
> [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address
> family not supported by protocol
>
>
>
> Strangely, the daemon works fine when it is rebooted. After running
>
> systemctl restart slurmctld.service
>
>
>
> the service status is
>
> [admin at localhost ~]$ systemctl status slurmctld.service
>
> ● slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
>     Drop-In: /etc/systemd/system/slurmctld.service.d
>              └─override.conf
>      Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
>    Main PID: 2815 (slurmctld)
>       Tasks: 7
>      Memory: 1.9M
>         CPU: 15ms
>      CGroup: /system.slice/slurmctld.service
>              └─2815 /usr/local/sbin/slurmctld -D
> Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
>
>
>
> Could anyone point me towards how to fix this? I expect it's just an issue
> with my configuration file, which I've copied below for reference.
>
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> #SlurmctldHost=localhost
> ControlMachine=localhost
> #
> #MailProg=/bin/mail
> MpiDefault=none
> #MpiParams=ports=#-#
> ProctrackType=proctrack/cgroup
> ReturnToService=1
> SlurmctldPidFile=/home/slurm/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/home/slurm/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurm/slurmd/
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/home/slurm/spool/
> SwitchType=switch/none
> TaskPlugin=task/affinity
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> #
> #
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> #JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> #SlurmctldDebug=info
> SlurmctldLogFile=/home/slurm/log/slurmctld.log
> #SlurmdDebug=info
> #SlurmdLogFile=
> #
> #
> # COMPUTE NODES
> NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP
>
>
>
> Thanks!
>
> -John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201214/3f724c11/attachment.htm>


More information about the slurm-users mailing list