[slurm-users] slurmctld daemon error
Avery Grieve
agrieve at umich.edu
Tue Dec 15 16:53:21 UTC 2020
Maybe a silly question, but where do you find the daemon logs or specify
their location?
~Avery Grieve
They/Them/Theirs please!
University of Michigan
On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment <projectalpha137 at gmail.com>
wrote:
> Hi,
>
> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
> running correctly; however the slurmctld daemon always errors.
> [admin at localhost ~]$ systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
> Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
> Main PID: 2363 (slurmd)
> Tasks: 2
> Memory: 3.4M
> CPU: 211ms
> CGroup: /system.slice/slurmd.service
> └─2363 /usr/local/sbin/slurmd -D
> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
> daemon.
> [admin at localhost ~]$ systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
> Drop-In: /etc/systemd/system/slurmctld.service.d
> └─override.conf
> Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST;
> 11min ago
> Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
> Main PID: 1972 (code=exited, status=1/FAILURE)
> CPU: 21ms
> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
> process exited, code=exited, status=1/FAILURE
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> Failed with result 'exit-code'.
>
> The slurmctld log is as follows:
> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
> cluster
> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
> "localhost"
> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
> supported
> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
> [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Running as primary controller
> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
> without address family
> [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
> Address family not supported by protocol
> [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address
> family not supported by protocol
>
> Strangely, the daemon works fine when it is rebooted. After running
> systemctl restart slurmctld.service
>
> the service status is
> [admin at localhost ~]$ systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
> Drop-In: /etc/systemd/system/slurmctld.service.d
> └─override.conf
> Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
> Main PID: 2815 (slurmctld)
> Tasks: 7
> Memory: 1.9M
> CPU: 15ms
> CGroup: /system.slice/slurmctld.service
> └─2815 /usr/local/sbin/slurmctld -D
> Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
>
> Could anyone point me towards how to fix this? I expect it's just an issue
> with my configuration file, which I've copied below for reference.
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> #SlurmctldHost=localhost
> ControlMachine=localhost
> #
> #MailProg=/bin/mail
> MpiDefault=none
> #MpiParams=ports=#-#
> ProctrackType=proctrack/cgroup
> ReturnToService=1
> SlurmctldPidFile=/home/slurm/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/home/slurm/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurm/slurmd/
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/home/slurm/spool/
> SwitchType=switch/none
> TaskPlugin=task/affinity
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> #
> #
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> #JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> #SlurmctldDebug=info
> SlurmctldLogFile=/home/slurm/log/slurmctld.log
> #SlurmdDebug=info
> #SlurmdLogFile=
> #
> #
> # COMPUTE NODES
> NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP
>
> Thanks!
> -John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201215/ee259326/attachment.htm>
More information about the slurm-users
mailing list