[slurm-users] slurmctld daemon error

Alpha Experiment projectalpha137 at gmail.com
Tue Dec 15 07:13:03 UTC 2020


Hi Brian,

My hosts file looks like this:
127.0.0.1   localhost localhost.localdomain localhost4
localhost4.localdomain4
::1         localhost localhost.localdomain localhost6
localhost6.localdomain6

I believe the second is an IPV6 address. Is it safe to delete that line?

Best,
John


On Mon, Dec 14, 2020 at 11:10 PM Brian Andrus <toomuchit at gmail.com> wrote:
>
> Check your hosts file and ensure 'localhost' does not have an IPV6
> address associated with it.
>
> Brian Andrus
>
> On 12/14/2020 4:19 PM, Alpha Experiment wrote:
> > Hi,
> >
> > I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
> > running correctly; however the slurmctld daemon always errors.
> > [admin at localhost ~]$ systemctl status slurmd.service
> > ● slurmd.service - Slurm node daemon
> >      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled;
> > vendor preset: disabled)
> >      Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min
ago
> >    Main PID: 2363 (slurmd)
> >       Tasks: 2
> >      Memory: 3.4M
> >         CPU: 211ms
> >      CGroup: /system.slice/slurmd.service
> >              └─2363 /usr/local/sbin/slurmd -D
> > Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
> > daemon.
> > [admin at localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> >      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> >     Drop-In: /etc/systemd/system/slurmctld.service.d
> >              └─override.conf
> >      Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12
> > PST; 11min ago
> >     Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
> >    Main PID: 1972 (code=exited, status=1/FAILURE)
> >         CPU: 21ms
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm
> > controller daemon.
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> > Main process exited, code=exited, status=1/FAILURE
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> > Failed with result 'exit-code'.
> >
> > The slurmctld log is as follows:
> > [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
> > cluster
> > [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
> > [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed:
> > Name or service not known
> > [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
> > "localhost"
> > [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0'
> > not supported
> > [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
> > [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> > [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> > [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
> > select/cons_tres: preparing for 1 partitions
> > [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> > [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not
specified
> > [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
> > select/cons_tres: reconfigure
> > [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
> > select/cons_tres: preparing for 1 partitions
> > [2020-12-14T16:02:12.779] Running as primary controller
> > [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values
set
> > [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
> > [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed:
> > Name or service not known
> > [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve
> > "(null)"
> > [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set
> > port without address family
> > [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
> > Address family not supported by protocol
> > [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error
> > Address family not supported by protocol
> >
> > Strangely, the daemon works fine when it is rebooted. After running
> > systemctl restart slurmctld.service
> >
> > the service status is
> > [admin at localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> >      Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> >     Drop-In: /etc/systemd/system/slurmctld.service.d
> >              └─override.conf
> >      Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
> >    Main PID: 2815 (slurmctld)
> >       Tasks: 7
> >      Memory: 1.9M
> >         CPU: 15ms
> >      CGroup: /system.slice/slurmctld.service
> >              └─2815 /usr/local/sbin/slurmctld -D
> > Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm
> > controller daemon.
> >
> > Could anyone point me towards how to fix this? I expect it's just an
> > issue with my configuration file, which I've copied below for reference.
> > # slurm.conf file generated by configurator easy.html.
> > # Put this file on all nodes of your cluster.
> > # See the slurm.conf man page for more information.
> > #
> > #SlurmctldHost=localhost
> > ControlMachine=localhost
> > #
> > #MailProg=/bin/mail
> > MpiDefault=none
> > #MpiParams=ports=#-#
> > ProctrackType=proctrack/cgroup
> > ReturnToService=1
> > SlurmctldPidFile=/home/slurm/run/slurmctld.pid
> > #SlurmctldPort=6817
> > SlurmdPidFile=/home/slurm/run/slurmd.pid
> > #SlurmdPort=6818
> > SlurmdSpoolDir=/var/spool/slurm/slurmd/
> > SlurmUser=slurm
> > #SlurmdUser=root
> > StateSaveLocation=/home/slurm/spool/
> > SwitchType=switch/none
> > TaskPlugin=task/affinity
> > #
> > #
> > # TIMERS
> > #KillWait=30
> > #MinJobAge=300
> > #SlurmctldTimeout=120
> > #SlurmdTimeout=300
> > #
> > #
> > # SCHEDULING
> > SchedulerType=sched/backfill
> > SelectType=select/cons_tres
> > SelectTypeParameters=CR_Core
> > #
> > #
> > # LOGGING AND ACCOUNTING
> > AccountingStorageType=accounting_storage/none
> > ClusterName=cluster
> > #JobAcctGatherFrequency=30
> > JobAcctGatherType=jobacct_gather/none
> > #SlurmctldDebug=info
> > SlurmctldLogFile=/home/slurm/log/slurmctld.log
> > #SlurmdDebug=info
> > #SlurmdLogFile=
> > #
> > #
> > # COMPUTE NODES
> > NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
> > CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
> > PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP
> >
> > Thanks!
> > -John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201214/5b409b6f/attachment.htm>


More information about the slurm-users mailing list