[slurm-users] slurmctld daemon error
mercan
ahmet.mercan at uhem.itu.edu.tr
Tue Dec 15 19:09:47 UTC 2020
Oh, yes! sorry, I confuse with the your email and Alpha Experiment's emails.
Ahmet M.
15.12.2020 21:59 tarihinde Avery Grieve yazdı:
> Hi Ahmet,
>
> Thank you for your suggestion. I assume you're talking about the
> SlurmctldHost field in the slurm.conf file? If so, I've got that
> variable defined as a hostname, not localhost.
>
> Thanks,
> Avery
>
> On Tue, Dec 15, 2020, 1:51 PM mercan <ahmet.mercan at uhem.itu.edu.tr
> <mailto:ahmet.mercan at uhem.itu.edu.tr>> wrote:
>
> Hi;
>
> I dont know the problem is this, but, I think the setting
> "ControlMachine=localhost" and not setting a hostname for slurm
> master
> node are not good decisions. How compute nodes decide the ip
> address of
> the slurm masternode from "localhost". Also, I suggest not using
> capital
> letters for any thing related the slurm.
>
> Ahmet M.
>
>
> 15.12.2020 21:15 tarihinde Avery Grieve yazdı:
> > I changed my .service file to write to a log. The slurm daemons are
> > running (manual start) on the compute nodes. I get this on startup
> > with the service enabled:
> >
> > [2020-12-15T18:09:06.412] slurmctld version 20.11.1 started on
> cluster
> > cluster
> > [2020-12-15T18:09:06.539] No memory enforcing mechanism configured.
> > [2020-12-15T18:09:06.572] error: get_addr_info: getaddrinfo()
> failed:
> > Name or service not known
> > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
> > "FireNode1"
> > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
> > not supported
> > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
> FireNode1
> > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
> failed:
> > Name or service not known
> > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
> > "FireNode2"
> > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
> > not supported
> > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
> FireNode2
> > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
> failed:
> > Name or service not known
> > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
> > "FireNode3"
> > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
> > not supported
> > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
> FireNode3
> > [2020-12-15T18:09:06.578] Recovered state of 3 nodes
> > [2020-12-15T18:09:06.579] Recovered information about 0 jobs
> > [2020-12-15T18:09:06.582] Recovered state of 0 reservations
> > [2020-12-15T18:09:06.582] read_slurm_conf: backup_controller not
> specified
> > [2020-12-15T18:09:06.583] Running as primary controller
> > [2020-12-15T18:09:06.592] No parameter for mcs plugin, default
> values set
> > [2020-12-15T18:09:06.592] mcs: MCSParameters = (null). ondemand set.
> > [2020-12-15T18:09:06.595] error: get_addr_info: getaddrinfo()
> failed:
> > Name or service not known
> > [2020-12-15T18:09:06.595] error: slurm_set_addr: Unable to resolve
> > "(null)"
> > [2020-12-15T18:09:06.595] error: slurm_set_port: attempting to set
> > port without address family
> > [2020-12-15T18:09:06.603] error: Error creating slurm stream
> socket:
> > Address family not supported by protocol
> > [2020-12-15T18:09:06.603] fatal: slurm_init_msg_engine_port error
> > Address family not supported by protocol
> >
> > The main errors seem to be issues resolving host names and not
> being
> > able to set the port. My /etc/hosts file defines the FireNode[1-3]
> > host IPs and does not contain any IPv6 ips.
> >
> > My service file includes a clause for "after
> network-online.target" as
> > well.
> >
> > Now, I start the daemon with "systemctl start slurmctld" and end up
> > with the following log:
> >
> > [2020-12-15T18:14:03.448] slurmctld version 20.11.1 started on
> cluster
> > cluster
> > [2020-12-15T18:14:03.456] No memory enforcing mechanism configured.
> > [2020-12-15T18:14:03.465] Recovered state of 3 nodes
> > [2020-12-15T18:14:03.465] Recovered information about 0 jobs
> > [2020-12-15T18:14:03.465] Recovered state of 0 reservations
> > [2020-12-15T18:14:03.466] read_slurm_conf: backup_controller not
> specified
> > [2020-12-15T18:14:03.466] Running as primary controller
> > [2020-12-15T18:14:03.466] No parameter for mcs plugin, default
> values set
> > [2020-12-15T18:14:03.466] mcs: MCSParameters = (null). ondemand set.
> >
> > As you can see, it starts up fine. Seems like something is wrong
> > during the initial startup network stack configuration or
> something.
> > I'm not really sure where to look to begin troubleshooting these. A
> > bit of googling hasn't revealed much either unfortunately.
> >
> > Any advice?
> >
> > ~Avery Grieve
> > They/Them/Theirs please!
> > University of Michigan
> >
> >
> > On Tue, Dec 15, 2020 at 11:53 AM Avery Grieve <agrieve at umich.edu
> <mailto:agrieve at umich.edu>
> > <mailto:agrieve at umich.edu <mailto:agrieve at umich.edu>>> wrote:
> >
> > Maybe a silly question, but where do you find the daemon logs or
> > specify their location?
> >
> > ~Avery Grieve
> > They/Them/Theirs please!
> > University of Michigan
> >
> >
> > On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment
> > <projectalpha137 at gmail.com
> <mailto:projectalpha137 at gmail.com>
> <mailto:projectalpha137 at gmail.com
> <mailto:projectalpha137 at gmail.com>>> wrote:
> >
> > Hi,
> >
> > I am trying to run slurm on Fedora 33. Upon boot the slurmd
> > daemon is running correctly; however the slurmctld daemon
> > always errors.
> > [admin at localhost ~]$ systemctl status slurmd.service
> > ● slurmd.service - Slurm node daemon
> > Loaded: loaded (/etc/systemd/system/slurmd.service;
> > enabled; vendor preset: disabled)
> > Active: active (running) since Mon 2020-12-14 16:02:18
> > PST; 11min ago
> > Main PID: 2363 (slurmd)
> > Tasks: 2
> > Memory: 3.4M
> > CPU: 211ms
> > CGroup: /system.slice/slurmd.service
> > └─2363 /usr/local/sbin/slurmd -D
> > Dec 14 16:02:18 localhost.localdomain systemd[1]: Started
> > Slurm node daemon.
> > [admin at localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> > Loaded: loaded (/etc/systemd/system/slurmctld.service;
> > enabled; vendor preset: disabled)
> > Drop-In: /etc/systemd/system/slurmctld.service.d
> > └─override.conf
> > Active: failed (Result: exit-code) since Mon 2020-12-14
> > 16:02:12 PST; 11min ago
> > Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
> > Main PID: 1972 (code=exited, status=1/FAILURE)
> > CPU: 21ms
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: Started
> > Slurm controller daemon.
> > Dec 14 16:02:12 localhost.localdomain systemd[1]:
> > slurmctld.service: Main process exited, code=exited,
> > status=1/FAILURE
> > Dec 14 16:02:12 localhost.localdomain systemd[1]:
> > slurmctld.service: Failed with result 'exit-code'.
> >
> > The slurmctld log is as follows:
> > [2020-12-14T16:02:12.731] slurmctld version 20.11.1
> started on
> > cluster cluster
> > [2020-12-14T16:02:12.739] No memory enforcing mechanism
> > configured.
> > [2020-12-14T16:02:12.772] error: get_addr_info:
> getaddrinfo()
> > failed: Name or service not known
> > [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to
> > resolve "localhost"
> > [2020-12-14T16:02:12.772] error: slurm_get_port: Address
> > family '0' not supported
> > [2020-12-14T16:02:12.772] error: _set_slurmd_addr:
> failure on
> > localhost
> > [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> > [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> > [2020-12-14T16:02:12.772] select/cons_tres:
> > part_data_create_array: select/cons_tres: preparing for 1
> > partitions
> > [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> > [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller
> > not specified
> > [2020-12-14T16:02:12.779] select/cons_tres:
> > select_p_reconfigure: select/cons_tres: reconfigure
> > [2020-12-14T16:02:12.779] select/cons_tres:
> > part_data_create_array: select/cons_tres: preparing for 1
> > partitions
> > [2020-12-14T16:02:12.779] Running as primary controller
> > [2020-12-14T16:02:12.780] No parameter for mcs plugin,
> default
> > values set
> > [2020-12-14T16:02:12.780] mcs: MCSParameters = (null).
> > ondemand set.
> > [2020-12-14T16:02:12.780] error: get_addr_info:
> getaddrinfo()
> > failed: Name or service not known
> > [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to
> > resolve "(null)"
> > [2020-12-14T16:02:12.780] error: slurm_set_port:
> attempting to
> > set port without address family
> > [2020-12-14T16:02:12.782] error: Error creating slurm stream
> > socket: Address family not supported by protocol
> > [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port
> > error Address family not supported by protocol
> >
> > Strangely, the daemon works fine when it is rebooted. After
> > running
> > systemctl restart slurmctld.service
> >
> > the service status is
> > [admin at localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> > Loaded: loaded (/etc/systemd/system/slurmctld.service;
> > enabled; vendor preset: disabled)
> > Drop-In: /etc/systemd/system/slurmctld.service.d
> > └─override.conf
> > Active: active (running) since Mon 2020-12-14 16:14:24
> > PST; 3s ago
> > Main PID: 2815 (slurmctld)
> > Tasks: 7
> > Memory: 1.9M
> > CPU: 15ms
> > CGroup: /system.slice/slurmctld.service
> > └─2815 /usr/local/sbin/slurmctld -D
> > Dec 14 16:14:24 localhost.localdomain systemd[1]: Started
> > Slurm controller daemon.
> >
> > Could anyone point me towards how to fix this? I expect it's
> > just an issue with my configuration file, which I've copied
> > below for reference.
> > # slurm.conf file generated by configurator easy.html.
> > # Put this file on all nodes of your cluster.
> > # See the slurm.conf man page for more information.
> > #
> > #SlurmctldHost=localhost
> > ControlMachine=localhost
> > #
> > #MailProg=/bin/mail
> > MpiDefault=none
> > #MpiParams=ports=#-#
> > ProctrackType=proctrack/cgroup
> > ReturnToService=1
> > SlurmctldPidFile=/home/slurm/run/slurmctld.pid
> > #SlurmctldPort=6817
> > SlurmdPidFile=/home/slurm/run/slurmd.pid
> > #SlurmdPort=6818
> > SlurmdSpoolDir=/var/spool/slurm/slurmd/
> > SlurmUser=slurm
> > #SlurmdUser=root
> > StateSaveLocation=/home/slurm/spool/
> > SwitchType=switch/none
> > TaskPlugin=task/affinity
> > #
> > #
> > # TIMERS
> > #KillWait=30
> > #MinJobAge=300
> > #SlurmctldTimeout=120
> > #SlurmdTimeout=300
> > #
> > #
> > # SCHEDULING
> > SchedulerType=sched/backfill
> > SelectType=select/cons_tres
> > SelectTypeParameters=CR_Core
> > #
> > #
> > # LOGGING AND ACCOUNTING
> > AccountingStorageType=accounting_storage/none
> > ClusterName=cluster
> > #JobAcctGatherFrequency=30
> > JobAcctGatherType=jobacct_gather/none
> > #SlurmctldDebug=info
> > SlurmctldLogFile=/home/slurm/log/slurmctld.log
> > #SlurmdDebug=info
> > #SlurmdLogFile=
> > #
> > #
> > # COMPUTE NODES
> > NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
> > CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
> > PartitionName=full Nodes=localhost Default=YES
> > MaxTime=INFINITE State=UP
> >
> > Thanks!
> > -John
> >
>
More information about the slurm-users
mailing list