[slurm-users] slurmctld daemon error

Tue Dec 15 19:09:47 UTC 2020

Oh, yes! sorry, I confuse with the your email and Alpha Experiment's emails.

Ahmet M.

15.12.2020 21:59 tarihinde Avery Grieve yazdı:
> Hi Ahmet,
>
> Thank you for your suggestion. I assume you're talking about the 
> SlurmctldHost field in the slurm.conf file? If so, I've got that 
> variable defined as a hostname, not localhost.
>
> Thanks,
> Avery
>
> On Tue, Dec 15, 2020, 1:51 PM mercan <ahmet.mercan at uhem.itu.edu.tr 
> <mailto:ahmet.mercan at uhem.itu.edu.tr>> wrote:
>
>     Hi;
>
>     I dont know the problem is this, but, I think the setting
>     "ControlMachine=localhost" and not setting a hostname for slurm
>     master
>     node are not good decisions. How compute nodes decide the ip
>     address of
>     the slurm masternode from "localhost". Also, I suggest not using
>     capital
>     letters for any thing related the slurm.
>
>     Ahmet M.
>
>
>     15.12.2020 21:15 tarihinde Avery Grieve yazdı:
>     > I changed my .service file to write to a log. The slurm daemons are
>     > running (manual start) on the compute nodes. I get this on startup
>     > with the service enabled:
>     >
>     > [2020-12-15T18:09:06.412] slurmctld version 20.11.1 started on
>     cluster
>     > cluster
>     > [2020-12-15T18:09:06.539] No memory enforcing mechanism configured.
>     > [2020-12-15T18:09:06.572] error: get_addr_info: getaddrinfo()
>     failed:
>     > Name or service not known
>     > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
>     > "FireNode1"
>     > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
>     > not supported
>     > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
>     FireNode1
>     > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
>     failed:
>     > Name or service not known
>     > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
>     > "FireNode2"
>     > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
>     > not supported
>     > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
>     FireNode2
>     > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
>     failed:
>     > Name or service not known
>     > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
>     > "FireNode3"
>     > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
>     > not supported
>     > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
>     FireNode3
>     > [2020-12-15T18:09:06.578] Recovered state of 3 nodes
>     > [2020-12-15T18:09:06.579] Recovered information about 0 jobs
>     > [2020-12-15T18:09:06.582] Recovered state of 0 reservations
>     > [2020-12-15T18:09:06.582] read_slurm_conf: backup_controller not
>     specified
>     > [2020-12-15T18:09:06.583] Running as primary controller
>     > [2020-12-15T18:09:06.592] No parameter for mcs plugin, default
>     values set
>     > [2020-12-15T18:09:06.592] mcs: MCSParameters = (null). ondemand set.
>     > [2020-12-15T18:09:06.595] error: get_addr_info: getaddrinfo()
>     failed:
>     > Name or service not known
>     > [2020-12-15T18:09:06.595] error: slurm_set_addr: Unable to resolve
>     > "(null)"
>     > [2020-12-15T18:09:06.595] error: slurm_set_port: attempting to set
>     > port without address family
>     > [2020-12-15T18:09:06.603] error: Error creating slurm stream
>     socket:
>     > Address family not supported by protocol
>     > [2020-12-15T18:09:06.603] fatal: slurm_init_msg_engine_port error
>     > Address family not supported by protocol
>     >
>     > The main errors seem to be issues resolving host names and not
>     being
>     > able to set the port. My /etc/hosts file defines the FireNode[1-3]
>     > host IPs and does not contain any IPv6 ips.
>     >
>     > My service file includes a clause for "after
>     network-online.target" as
>     > well.
>     >
>     > Now, I start the daemon with "systemctl start slurmctld" and end up
>     > with the following log:
>     >
>     > [2020-12-15T18:14:03.448] slurmctld version 20.11.1 started on
>     cluster
>     > cluster
>     > [2020-12-15T18:14:03.456] No memory enforcing mechanism configured.
>     > [2020-12-15T18:14:03.465] Recovered state of 3 nodes
>     > [2020-12-15T18:14:03.465] Recovered information about 0 jobs
>     > [2020-12-15T18:14:03.465] Recovered state of 0 reservations
>     > [2020-12-15T18:14:03.466] read_slurm_conf: backup_controller not
>     specified
>     > [2020-12-15T18:14:03.466] Running as primary controller
>     > [2020-12-15T18:14:03.466] No parameter for mcs plugin, default
>     values set
>     > [2020-12-15T18:14:03.466] mcs: MCSParameters = (null). ondemand set.
>     >
>     > As you can see, it starts up fine. Seems like something is wrong
>     > during the initial startup network stack configuration or
>     something.
>     > I'm not really sure where to look to begin troubleshooting these. A
>     > bit of googling hasn't revealed much either unfortunately.
>     >
>     > Any advice?
>     >
>     > ~Avery Grieve
>     > They/Them/Theirs please!
>     > University of Michigan
>     >
>     >
>     > On Tue, Dec 15, 2020 at 11:53 AM Avery Grieve <agrieve at umich.edu
>     <mailto:agrieve at umich.edu>
>     > <mailto:agrieve at umich.edu <mailto:agrieve at umich.edu>>> wrote:
>     >
>     >     Maybe a silly question, but where do you find the daemon logs or
>     >     specify their location?
>     >
>     >     ~Avery Grieve
>     >     They/Them/Theirs please!
>     >     University of Michigan
>     >
>     >
>     >     On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment
>     >     <projectalpha137 at gmail.com
>     <mailto:projectalpha137 at gmail.com>
>     <mailto:projectalpha137 at gmail.com
>     <mailto:projectalpha137 at gmail.com>>> wrote:
>     >
>     >         Hi,
>     >
>     >         I am trying to run slurm on Fedora 33. Upon boot the slurmd
>     >         daemon is running correctly; however the slurmctld daemon
>     >         always errors.
>     >         [admin at localhost ~]$ systemctl status slurmd.service
>     >         ● slurmd.service - Slurm node daemon
>     >              Loaded: loaded (/etc/systemd/system/slurmd.service;
>     >         enabled; vendor preset: disabled)
>     >              Active: active (running) since Mon 2020-12-14 16:02:18
>     >         PST; 11min ago
>     >            Main PID: 2363 (slurmd)
>     >               Tasks: 2
>     >              Memory: 3.4M
>     >                 CPU: 211ms
>     >              CGroup: /system.slice/slurmd.service
>     >                      └─2363 /usr/local/sbin/slurmd -D
>     >         Dec 14 16:02:18 localhost.localdomain systemd[1]: Started
>     >         Slurm node daemon.
>     >         [admin at localhost ~]$ systemctl status slurmctld.service
>     >         ● slurmctld.service - Slurm controller daemon
>     >              Loaded: loaded (/etc/systemd/system/slurmctld.service;
>     >         enabled; vendor preset: disabled)
>     >             Drop-In: /etc/systemd/system/slurmctld.service.d
>     >                      └─override.conf
>     >              Active: failed (Result: exit-code) since Mon 2020-12-14
>     >         16:02:12 PST; 11min ago
>     >             Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
>     >         $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>     >            Main PID: 1972 (code=exited, status=1/FAILURE)
>     >                 CPU: 21ms
>     >         Dec 14 16:02:12 localhost.localdomain systemd[1]: Started
>     >         Slurm controller daemon.
>     >         Dec 14 16:02:12 localhost.localdomain systemd[1]:
>     >         slurmctld.service: Main process exited, code=exited,
>     >         status=1/FAILURE
>     >         Dec 14 16:02:12 localhost.localdomain systemd[1]:
>     >         slurmctld.service: Failed with result 'exit-code'.
>     >
>     >         The slurmctld log is as follows:
>     >         [2020-12-14T16:02:12.731] slurmctld version 20.11.1
>     started on
>     >         cluster cluster
>     >         [2020-12-14T16:02:12.739] No memory enforcing mechanism
>     >         configured.
>     >         [2020-12-14T16:02:12.772] error: get_addr_info:
>     getaddrinfo()
>     >         failed: Name or service not known
>     >         [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to
>     >         resolve "localhost"
>     >         [2020-12-14T16:02:12.772] error: slurm_get_port: Address
>     >         family '0' not supported
>     >         [2020-12-14T16:02:12.772] error: _set_slurmd_addr:
>     failure on
>     >         localhost
>     >         [2020-12-14T16:02:12.772] Recovered state of 1 nodes
>     >         [2020-12-14T16:02:12.772] Recovered information about 0 jobs
>     >         [2020-12-14T16:02:12.772] select/cons_tres:
>     >         part_data_create_array: select/cons_tres: preparing for 1
>     >         partitions
>     >         [2020-12-14T16:02:12.779] Recovered state of 0 reservations
>     >         [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller
>     >         not specified
>     >         [2020-12-14T16:02:12.779] select/cons_tres:
>     >         select_p_reconfigure: select/cons_tres: reconfigure
>     >         [2020-12-14T16:02:12.779] select/cons_tres:
>     >         part_data_create_array: select/cons_tres: preparing for 1
>     >         partitions
>     >         [2020-12-14T16:02:12.779] Running as primary controller
>     >         [2020-12-14T16:02:12.780] No parameter for mcs plugin,
>     default
>     >         values set
>     >         [2020-12-14T16:02:12.780] mcs: MCSParameters = (null).
>     >         ondemand set.
>     >         [2020-12-14T16:02:12.780] error: get_addr_info:
>     getaddrinfo()
>     >         failed: Name or service not known
>     >         [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to
>     >         resolve "(null)"
>     >         [2020-12-14T16:02:12.780] error: slurm_set_port:
>     attempting to
>     >         set port without address family
>     >         [2020-12-14T16:02:12.782] error: Error creating slurm stream
>     >         socket: Address family not supported by protocol
>     >         [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port
>     >         error Address family not supported by protocol
>     >
>     >         Strangely, the daemon works fine when it is rebooted. After
>     >         running
>     >         systemctl restart slurmctld.service
>     >
>     >         the service status is
>     >         [admin at localhost ~]$ systemctl status slurmctld.service
>     >         ● slurmctld.service - Slurm controller daemon
>     >              Loaded: loaded (/etc/systemd/system/slurmctld.service;
>     >         enabled; vendor preset: disabled)
>     >             Drop-In: /etc/systemd/system/slurmctld.service.d
>     >                      └─override.conf
>     >              Active: active (running) since Mon 2020-12-14 16:14:24
>     >         PST; 3s ago
>     >            Main PID: 2815 (slurmctld)
>     >               Tasks: 7
>     >              Memory: 1.9M
>     >                 CPU: 15ms
>     >              CGroup: /system.slice/slurmctld.service
>     >                      └─2815 /usr/local/sbin/slurmctld -D
>     >         Dec 14 16:14:24 localhost.localdomain systemd[1]: Started
>     >         Slurm controller daemon.
>     >
>     >         Could anyone point me towards how to fix this? I expect it's
>     >         just an issue with my configuration file, which I've copied
>     >         below for reference.
>     >         # slurm.conf file generated by configurator easy.html.
>     >         # Put this file on all nodes of your cluster.
>     >         # See the slurm.conf man page for more information.
>     >         #
>     >         #SlurmctldHost=localhost
>     >         ControlMachine=localhost
>     >         #
>     >         #MailProg=/bin/mail
>     >         MpiDefault=none
>     >         #MpiParams=ports=#-#
>     >         ProctrackType=proctrack/cgroup
>     >         ReturnToService=1
>     >  SlurmctldPidFile=/home/slurm/run/slurmctld.pid
>     >         #SlurmctldPort=6817
>     >         SlurmdPidFile=/home/slurm/run/slurmd.pid
>     >         #SlurmdPort=6818
>     >         SlurmdSpoolDir=/var/spool/slurm/slurmd/
>     >         SlurmUser=slurm
>     >         #SlurmdUser=root
>     >         StateSaveLocation=/home/slurm/spool/
>     >         SwitchType=switch/none
>     >         TaskPlugin=task/affinity
>     >         #
>     >         #
>     >         # TIMERS
>     >         #KillWait=30
>     >         #MinJobAge=300
>     >         #SlurmctldTimeout=120
>     >         #SlurmdTimeout=300
>     >         #
>     >         #
>     >         # SCHEDULING
>     >         SchedulerType=sched/backfill
>     >         SelectType=select/cons_tres
>     >         SelectTypeParameters=CR_Core
>     >         #
>     >         #
>     >         # LOGGING AND ACCOUNTING
>     >         AccountingStorageType=accounting_storage/none
>     >         ClusterName=cluster
>     >         #JobAcctGatherFrequency=30
>     >         JobAcctGatherType=jobacct_gather/none
>     >         #SlurmctldDebug=info
>     >  SlurmctldLogFile=/home/slurm/log/slurmctld.log
>     >         #SlurmdDebug=info
>     >         #SlurmdLogFile=
>     >         #
>     >         #
>     >         # COMPUTE NODES
>     >         NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
>     >         CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
>     >         PartitionName=full Nodes=localhost Default=YES
>     >         MaxTime=INFINITE State=UP
>     >
>     >         Thanks!
>     >         -John
>     >
>