[slurm-users] slurmctld daemon error

Alpha Experiment projectalpha137 at gmail.com
Tue Dec 15 07:01:19 UTC 2020


Hi Luke and Avery,

Changed the After line in the slurmctld.service file to
After=network.target munge.service slurmd.service

This seemed to do the trick!

Best,
John

On Mon, Dec 14, 2020 at 6:10 PM Avery Grieve <agrieve at umich.edu> wrote:

> Hey Luke, I'm getting the same issues with my slurmctld daemon not
> starting on boot (as well as my slurmd daemon). Both fail with the same
> messages John got above (just Exit Code).
>
> My slurmctld service file in /etc/systemd/system/ looks like this:
>
> [Unit]
> Description=Slurm controller daemon
> After=network.target munge.service
> ConditionPathExists=/etc/slurm-llnl/slurm.conf
>
> [Service]
> Type=simple
> EnvironmentFile=-/etc/default/slurmctld
> ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> LimitNOFILE=65536
>
> [Install]
> WantedBy=multi-user.target
>
> Similar to John, my daemon starts if I just run the systemctl start
> command following boot.
>
> ~Avery Grieve
> They/Them/Theirs please!
> University of Michigan
>
>
> On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager <lyeager at nvidia.com> wrote:
>
>> What does your ‘slurmctld.service’ look like? You might want to add
>> something to the ‘After=’ section if your service is starting too quickly.
>>
>>
>>
>> e.g. we use ‘After=network.target munge.service’ (see here
>> <https://github.com/NVIDIA/nephele-packages/blob/30bc321c311398cc7a86485bc88930e4b6790fb4/slurm/debian/PACKAGE-control.slurmctld.service#L3>).
>>
>>
>>
>>
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Alpha Experiment
>> *Sent:* Monday, December 14, 2020 4:20 PM
>> *To:* slurm-users at lists.schedmd.com
>> *Subject:* [slurm-users] slurmctld daemon error
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
>> running correctly; however the slurmctld daemon always errors.
>>
>> [admin at localhost ~]$ systemctl status slurmd.service
>> ● slurmd.service - Slurm node daemon
>>      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
>> preset: disabled)
>>      Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
>>    Main PID: 2363 (slurmd)
>>       Tasks: 2
>>      Memory: 3.4M
>>         CPU: 211ms
>>      CGroup: /system.slice/slurmd.service
>>              └─2363 /usr/local/sbin/slurmd -D
>> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
>> daemon.
>>
>> [admin at localhost ~]$ systemctl status slurmctld.service
>> ● slurmctld.service - Slurm controller daemon
>>      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>> vendor preset: disabled)
>>     Drop-In: /etc/systemd/system/slurmctld.service.d
>>              └─override.conf
>>      Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12
>> PST; 11min ago
>>     Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
>> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>>    Main PID: 1972 (code=exited, status=1/FAILURE)
>>         CPU: 21ms
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm
>> controller daemon.
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
>> process exited, code=exited, status=1/FAILURE
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
>> Failed with result 'exit-code'.
>>
>>
>>
>> The slurmctld log is as follows:
>>
>> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
>> cluster
>> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
>> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed:
>> Name or service not known
>> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
>> "localhost"
>> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
>> supported
>> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
>> [2020-12-14T16:02:12.772] Recovered state of 1 nodes
>> [2020-12-14T16:02:12.772] Recovered information about 0 jobs
>> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2020-12-14T16:02:12.779] Recovered state of 0 reservations
>> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
>> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
>> select/cons_tres: reconfigure
>> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2020-12-14T16:02:12.779] Running as primary controller
>> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
>> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
>> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed:
>> Name or service not known
>> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve
>> "(null)"
>> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
>> without address family
>> [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
>> Address family not supported by protocol
>>
>> [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address
>> family not supported by protocol
>>
>>
>>
>> Strangely, the daemon works fine when it is rebooted. After running
>>
>> systemctl restart slurmctld.service
>>
>>
>>
>> the service status is
>>
>> [admin at localhost ~]$ systemctl status slurmctld.service
>>
>> ● slurmctld.service - Slurm controller daemon
>>      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>> vendor preset: disabled)
>>     Drop-In: /etc/systemd/system/slurmctld.service.d
>>              └─override.conf
>>      Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
>>    Main PID: 2815 (slurmctld)
>>       Tasks: 7
>>      Memory: 1.9M
>>         CPU: 15ms
>>      CGroup: /system.slice/slurmctld.service
>>              └─2815 /usr/local/sbin/slurmctld -D
>> Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm
>> controller daemon.
>>
>>
>>
>> Could anyone point me towards how to fix this? I expect it's just an
>> issue with my configuration file, which I've copied below for reference.
>>
>> # slurm.conf file generated by configurator easy.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> #SlurmctldHost=localhost
>> ControlMachine=localhost
>> #
>> #MailProg=/bin/mail
>> MpiDefault=none
>> #MpiParams=ports=#-#
>> ProctrackType=proctrack/cgroup
>> ReturnToService=1
>> SlurmctldPidFile=/home/slurm/run/slurmctld.pid
>> #SlurmctldPort=6817
>> SlurmdPidFile=/home/slurm/run/slurmd.pid
>> #SlurmdPort=6818
>> SlurmdSpoolDir=/var/spool/slurm/slurmd/
>> SlurmUser=slurm
>> #SlurmdUser=root
>> StateSaveLocation=/home/slurm/spool/
>> SwitchType=switch/none
>> TaskPlugin=task/affinity
>> #
>> #
>> # TIMERS
>> #KillWait=30
>> #MinJobAge=300
>> #SlurmctldTimeout=120
>> #SlurmdTimeout=300
>> #
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>> #
>> #
>> # LOGGING AND ACCOUNTING
>> AccountingStorageType=accounting_storage/none
>> ClusterName=cluster
>> #JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> #SlurmctldDebug=info
>> SlurmctldLogFile=/home/slurm/log/slurmctld.log
>> #SlurmdDebug=info
>> #SlurmdLogFile=
>> #
>> #
>> # COMPUTE NODES
>> NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64
>> ThreadsPerCore=2 State=UNKNOWN
>> PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP
>>
>>
>>
>> Thanks!
>>
>> -John
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201214/a57bca80/attachment-0001.htm>


More information about the slurm-users mailing list