<div dir="ltr"><div>Maybe a silly question, but where do you find the daemon logs or specify their location? <br></div><div><br></div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">~Avery Grieve</div><div>They/Them/Theirs please!<br></div><div dir="ltr"><div>University of Michigan</div></div></div></div></div></div></div></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment <<a href="mailto:projectalpha137@gmail.com">projectalpha137@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors.</div><div><font size="1" face="monospace">[admin@localhost ~]$ systemctl status slurmd.service <br>● slurmd.service - Slurm node daemon<br> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)<br> Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago<br> Main PID: 2363 (slurmd)<br> Tasks: 2<br> Memory: 3.4M<br> CPU: 211ms<br> CGroup: /system.slice/slurmd.service<br> └─2363 /usr/local/sbin/slurmd -D<br>Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon.</font></div><div><font size="1" face="monospace">[admin@localhost ~]$ systemctl status slurmctld.service <br>● slurmctld.service - Slurm controller daemon<br> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br> Drop-In: /etc/systemd/system/slurmctld.service.d<br> └─override.conf<br> Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST; 11min ago<br> Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)<br> Main PID: 1972 (code=exited, status=1/FAILURE)<br> CPU: 21ms<br>Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller daemon.<br>Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE<br>Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed with result 'exit-code'.</font><br></div><div><br></div><div>The slurmctld log is as follows:</div><font size="1" face="monospace">[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster cluster<br>[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.<br>[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name or service not known<br>[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve "localhost"<br>[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not supported<br>[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost<br>[2020-12-14T16:02:12.772] Recovered state of 1 nodes<br>[2020-12-14T16:02:12.772] Recovered information about 0 jobs<br>[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions<br>[2020-12-14T16:02:12.779] Recovered state of 0 reservations<br>[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified<br>[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure<br>[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions<br>[2020-12-14T16:02:12.779] Running as primary controller<br>[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set<br>[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.<br>[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name or service not known<br>[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"<br>[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port without address family<br>[2020-12-14T16:02:12.782] error: Error creating slurm stream socket: Address family not supported by protocol<br></font><div><font size="1" face="monospace">[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address family not supported by protocol </font></div><div><br></div><div>Strangely, the daemon works fine when it is rebooted. After running</div><div><font size="1" face="monospace">systemctl restart slurmctld.service</font><br></div><div><br></div><div>the service status is</div><div><font size="1" face="monospace">[admin@localhost ~]$ systemctl status slurmctld.service <br></font></div><div><font size="1" face="monospace">● slurmctld.service - Slurm controller daemon<br> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br> Drop-In: /etc/systemd/system/slurmctld.service.d<br> └─override.conf<br> Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago<br> Main PID: 2815 (slurmctld)<br> Tasks: 7<br> Memory: 1.9M<br> CPU: 15ms<br> CGroup: /system.slice/slurmctld.service<br> └─2815 /usr/local/sbin/slurmctld -D<br>Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller daemon.</font><br></div><div><br></div><div>Could anyone point me towards how to fix this? I expect it's just an issue with my configuration file, which I've copied below for reference.</div><div><font size="1" face="monospace"># slurm.conf file generated by configurator easy.html.<br># Put this file on all nodes of your cluster.<br># See the slurm.conf man page for more information.<br>#<br>#SlurmctldHost=localhost<br>ControlMachine=localhost<br>#<br>#MailProg=/bin/mail<br>MpiDefault=none<br>#MpiParams=ports=#-#<br>ProctrackType=proctrack/cgroup<br>ReturnToService=1<br>SlurmctldPidFile=/home/slurm/run/slurmctld.pid<br>#SlurmctldPort=6817<br>SlurmdPidFile=/home/slurm/run/slurmd.pid<br>#SlurmdPort=6818<br>SlurmdSpoolDir=/var/spool/slurm/slurmd/<br>SlurmUser=slurm<br>#SlurmdUser=root<br>StateSaveLocation=/home/slurm/spool/<br>SwitchType=switch/none<br>TaskPlugin=task/affinity<br>#<br>#<br># TIMERS<br>#KillWait=30<br>#MinJobAge=300<br>#SlurmctldTimeout=120<br>#SlurmdTimeout=300<br>#<br>#<br># SCHEDULING<br>SchedulerType=sched/backfill<br>SelectType=select/cons_tres<br>SelectTypeParameters=CR_Core<br>#<br>#<br># LOGGING AND ACCOUNTING<br>AccountingStorageType=accounting_storage/none<br>ClusterName=cluster<br>#JobAcctGatherFrequency=30<br>JobAcctGatherType=jobacct_gather/none<br>#SlurmctldDebug=info<br>SlurmctldLogFile=/home/slurm/log/slurmctld.log<br>#SlurmdDebug=info<br>#SlurmdLogFile=<br>#<br>#<br># COMPUTE NODES<br>NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN<br>PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP</font><br></div><div><br></div><div>Thanks!</div><div>-John</div></div>
</blockquote></div>