<div dir="ltr"><div>Hey Luke, I'm getting the same issues with my slurmctld daemon not starting on boot (as well as my slurmd daemon). Both fail with the same messages John got above (just Exit Code). <br></div><div><br></div><div>My slurmctld service file in /etc/systemd/system/ looks like this:<br></div><div><br></div><div style="margin-left:40px"><span style="font-family:monospace">[Unit]<br>Description=Slurm controller daemon<br>After=network.target munge.service<br>ConditionPathExists=/etc/slurm-llnl/slurm.conf<br><br>[Service]<br>Type=simple<br>EnvironmentFile=-/etc/default/slurmctld<br>ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS<br>ExecReload=/bin/kill -HUP $MAINPID<br>LimitNOFILE=65536<br><br>[Install]<br>WantedBy=multi-user.target</span><br></div><div><br></div><div>Similar to John, my daemon starts if I just run the systemctl start command following boot. <br></div><div><br></div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">~Avery Grieve</div><div>They/Them/Theirs please!<br></div><div dir="ltr"><div>University of Michigan</div></div></div></div></div></div></div></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager <<a href="mailto:lyeager@nvidia.com">lyeager@nvidia.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" lang="EN-US">
<div class="gmail-m_-7330438505544276869WordSection1">
<p class="MsoNormal">What does your ‘slurmctld.service’ look like? You might want to add something to the ‘After=’ section if your service is starting too quickly.
<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">e.g. we use ‘After=network.target munge.service’ (<a href="https://github.com/NVIDIA/nephele-packages/blob/30bc321c311398cc7a86485bc88930e4b6790fb4/slurm/debian/PACKAGE-control.slurmctld.service#L3" target="_blank">see here</a>).
<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div style="border-color:rgb(225,225,225) currentcolor currentcolor;border-style:solid none none;border-width:1pt medium medium;padding:3pt 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>
<b>On Behalf Of </b>Alpha Experiment<br>
<b>Sent:</b> Monday, December 14, 2020 4:20 PM<br>
<b>To:</b> <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
<b>Subject:</b> [slurm-users] slurmctld daemon error<u></u><u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<table style="background:rgb(255,235,156) none repeat scroll 0% 0%" cellspacing="3" cellpadding="0" border="1">
<tbody>
<tr>
<td style="padding:0.75pt">
<p class="MsoNormal"><b><span style="font-size:7.5pt;font-family:"Verdana",sans-serif;color:black">External email: Use caution opening links or attachments</span></b><span style="font-size:7.5pt;font-family:"Verdana",sans-serif;color:black">
</span><u></u><u></u></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">Hi, <u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">[admin@localhost ~]$ systemctl status slurmd.service
<br>
● slurmd.service - Slurm node daemon<br>
Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)<br>
Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago<br>
Main PID: 2363 (slurmd)<br>
Tasks: 2<br>
Memory: 3.4M<br>
CPU: 211ms<br>
CGroup: /system.slice/slurmd.service<br>
└─2363 /usr/local/sbin/slurmd -D<br>
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon.</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">[admin@localhost ~]$ systemctl status slurmctld.service
<br>
● slurmctld.service - Slurm controller daemon<br>
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br>
Drop-In: /etc/systemd/system/slurmctld.service.d<br>
└─override.conf<br>
Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST; 11min ago<br>
Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)<br>
Main PID: 1972 (code=exited, status=1/FAILURE)<br>
CPU: 21ms<br>
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller daemon.<br>
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE<br>
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed with result 'exit-code'.</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The slurmctld log is as follows:<u></u><u></u></p>
</div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster cluster<br>
[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.<br>
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name or service not known<br>
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve "localhost"<br>
[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not supported<br>
[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost<br>
[2020-12-14T16:02:12.772] Recovered state of 1 nodes<br>
[2020-12-14T16:02:12.772] Recovered information about 0 jobs<br>
[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions<br>
[2020-12-14T16:02:12.779] Recovered state of 0 reservations<br>
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified<br>
[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure<br>
[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions<br>
[2020-12-14T16:02:12.779] Running as primary controller<br>
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set<br>
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.<br>
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name or service not known<br>
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"<br>
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port without address family<br>
[2020-12-14T16:02:12.782] error: Error creating slurm stream socket: Address family not supported by protocol</span><u></u><u></u></p>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address family not supported by protocol </span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Strangely, the daemon works fine when it is rebooted. After running<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">systemctl restart slurmctld.service</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">the service status is<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">[admin@localhost ~]$ systemctl status slurmctld.service </span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">● slurmctld.service - Slurm controller daemon<br>
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br>
Drop-In: /etc/systemd/system/slurmctld.service.d<br>
└─override.conf<br>
Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago<br>
Main PID: 2815 (slurmctld)<br>
Tasks: 7<br>
Memory: 1.9M<br>
CPU: 15ms<br>
CGroup: /system.slice/slurmctld.service<br>
└─2815 /usr/local/sbin/slurmctld -D<br>
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller daemon.</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Could anyone point me towards how to fix this? I expect it's just an issue with my configuration file, which I've copied below for reference.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New""># slurm.conf file generated by configurator easy.html.<br>
# Put this file on all nodes of your cluster.<br>
# See the slurm.conf man page for more information.<br>
#<br>
#SlurmctldHost=localhost<br>
ControlMachine=localhost<br>
#<br>
#MailProg=/bin/mail<br>
MpiDefault=none<br>
#MpiParams=ports=#-#<br>
ProctrackType=proctrack/cgroup<br>
ReturnToService=1<br>
SlurmctldPidFile=/home/slurm/run/slurmctld.pid<br>
#SlurmctldPort=6817<br>
SlurmdPidFile=/home/slurm/run/slurmd.pid<br>
#SlurmdPort=6818<br>
SlurmdSpoolDir=/var/spool/slurm/slurmd/<br>
SlurmUser=slurm<br>
#SlurmdUser=root<br>
StateSaveLocation=/home/slurm/spool/<br>
SwitchType=switch/none<br>
TaskPlugin=task/affinity<br>
#<br>
#<br>
# TIMERS<br>
#KillWait=30<br>
#MinJobAge=300<br>
#SlurmctldTimeout=120<br>
#SlurmdTimeout=300<br>
#<br>
#<br>
# SCHEDULING<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_tres<br>
SelectTypeParameters=CR_Core<br>
#<br>
#<br>
# LOGGING AND ACCOUNTING<br>
AccountingStorageType=accounting_storage/none<br>
ClusterName=cluster<br>
#JobAcctGatherFrequency=30<br>
JobAcctGatherType=jobacct_gather/none<br>
#SlurmctldDebug=info<br>
SlurmctldLogFile=/home/slurm/log/slurmctld.log<br>
#SlurmdDebug=info<br>
#SlurmdLogFile=<br>
#<br>
#<br>
# COMPUTE NODES<br>
NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN<br>
PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Thanks!<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-John<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</blockquote></div>