Hi,
I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3 nodes with these hostnames and local IP addresses: server1 - 10.36.17.152 server2 - 10.36.17.166 server3 - 10.36.17.132
I had scrambled together a minimum working example using these resources: https://github.com/SergioMEV/slurm-for-dummies https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-troub...
For a while everything looked fine and I was able to run the command I usually use to see if everything is fine:
srun --label --nodes=3 hostname
Which used to show the expected output of the hostnames of all 3 computers, namely: server1, server2, and server3.
However - after having made no changes to the configs - the command no longer works if I specify the number of nodes as anything more than 1, this behaviour is consistent on all 3 computers, the output of 'sinfo' is also included below:
root@server1:~# srun --nodes=1 hostnameserver1root@server1:~# root@server1:~# srun --nodes=3 hostnamesrun: Required node not available (down, drained or reserved)srun: job 312 queued and waiting for resources^Csrun: Job allocation 312 has been revokedsrun: Force Terminated JobId=312root@server1:~# root@server1:~# ssh server2 "srun --nodes=1 hostname"server1root@server1:~# root@server1:~# ssh server2 "srun --nodes=3 hostname"srun: Required node not available (down, drained or reserved)srun: job 314 queued and waiting for resources^Croot@server1:~# root@server1:~# root@server1:~# sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTmainPartition* up infinite 2 down* server[2-3]mainPartition* up infinite 1 idle server1root@server1:~#
Turns out, slurmctld on the master node (hostname: server1) and slurmd on the slave nodes (hostnames: server2 & server3) are throwing some errors probably related to networking: A few lines before and after the first occurence of the error in slurmctld.log on the master node - it's the only type of error I have noticed in the logs (pastebin to the entire log https://pastebin.com/GBSWXZJR):
root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error" slurmctld.log[2024-07-26T13:13:49.579] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions[2024-07-26T13:13:49.580] debug: power_save module disabled, SuspendTime < 0[2024-07-26T13:13:49.580] Running as primary controller[2024-07-26T13:13:49.580] debug: No backup controllers, not launching heartbeat.[2024-07-26T13:13:49.580] debug: priority/basic: init: Priority BASIC plugin loaded[2024-07-26T13:13:49.580] No parameter for mcs plugin, default values set[2024-07-26T13:13:49.580] mcs: MCSParameters = (null). ondemand set.[2024-07-26T13:13:49.580] debug: mcs/none: init: mcs none plugin loaded[2024-07-26T13:13:49.580] debug2: slurmctld listening on 0.0.0.0:6817[2024-07-26T13:13:52.662] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded[2024-07-26T13:13:52.662] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0[2024-07-26T13:13:52.662] debug: gres/gpu: init: loaded[2024-07-26T13:13:52.662] debug: validate_node_specs: node server1 registered with 0 jobs[2024-07-26T13:13:52.662] debug2: _slurm_rpc_node_registration complete for server1 usec=229[2024-07-26T13:13:53.586] debug: Spawning registration agent for server[2-3] 2 hosts[2024-07-26T13:13:53.586] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2024-07-26T13:13:53.586] debug: sched: Running job scheduler for default depth.[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS[2024-07-26T13:13:53.587] debug2: Tree head got back 0 looking for 2[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.132:6818: Connection refused[2024-07-26T13:13:54.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.589] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused
The connections to 10.36.17.166:6818 and 10.36.17.132:6818 are refused. Those are ports specified by the 'SlurmdPort' key in slurm.conf
There are similar errors in the slurmd.log files on both the slave nodes as well: slurmd.log on server2, the error is only at the end of the file (pastebin to the entire log https://pastebin.com/TwSMiAp7):
root@server2:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.018] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.018] error: Unable to bind listen port (6818): Address already in use
slurmd.log on server3 (pastebin to the entire log https://pastebin.com/K55cAGLb):
root@server3:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.384] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.384] error: Unable to bind listen port (6818): Address already in use
I use this script to restart slurm whenever I change any of the configs, could the order in which these operations are being done cause the problems I'm facing:
#! /bin/bashscp /etc/slurm/slurm.conf /etc/slurm/gres.conf server2:/etc/slurm/ && echo copied slurm.conf and gres.conf to server2;scp /etc/slurm/slurm.conf /etc/slurm/gres.conf server3:/etc/slurm/ && echo copied slurm.conf and gres.conf to server3;echoecho restarting slurmctld and slurmd on server1(scontrol shutdown ; sleep 3 ; rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmctld -d ; sleep 3 ; slurmd) && echo doneecho restarting slurmd on server2(ssh server2 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo doneecho restarting slurmd on server3(ssh server3 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo done
Config files: slurm.conf without the comments:
root@server1:/etc/slurm# grep -v "#" slurm.conf ClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=debug2SlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug2SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf:
root@server1:/etc/slurm# cat gres.confNodeName=server1 Name=gpu File=/dev/nvidia0NodeName=server2 Name=gpu File=/dev/nvidia0NodeName=server3 Name=gpu File=/dev/nvidia0
These config files are the same on all 3 computers.
As a complete beginner to Linux and Slurm administration, I have been struggling to understand even the most basic documentation, and I have been unable to find answers online. Any assistance would be greatly appreciated.
Thanks!
This solved my problem: https://www.reddit.com/r/HPC/comments/1eb3f0g/comment/lfmed27/?utm_source=sh...
On Fri, Jul 26, 2024 at 3:37 PM Shooktija S N shooktijasn@gmail.com wrote:
Hi,
I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3 nodes with these hostnames and local IP addresses: server1 - 10.36.17.152 server2 - 10.36.17.166 server3 - 10.36.17.132
I had scrambled together a minimum working example using these resources: https://github.com/SergioMEV/slurm-for-dummies
https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-troub...
For a while everything looked fine and I was able to run the command I usually use to see if everything is fine:
srun --label --nodes=3 hostname
Which used to show the expected output of the hostnames of all 3 computers, namely: server1, server2, and server3.
However - after having made no changes to the configs - the command no longer works if I specify the number of nodes as anything more than 1, this behaviour is consistent on all 3 computers, the output of 'sinfo' is also included below:
root@server1:~# srun --nodes=1 hostnameserver1root@server1:~# root@server1:~# srun --nodes=3 hostnamesrun: Required node not available (down, drained or reserved)srun: job 312 queued and waiting for resources^Csrun: Job allocation 312 has been revokedsrun: Force Terminated JobId=312root@server1:~# root@server1:~# ssh server2 "srun --nodes=1 hostname"server1root@server1:~# root@server1:~# ssh server2 "srun --nodes=3 hostname"srun: Required node not available (down, drained or reserved)srun: job 314 queued and waiting for resources^Croot@server1:~# root@server1:~# root@server1:~# sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTmainPartition* up infinite 2 down* server[2-3]mainPartition* up infinite 1 idle server1root@server1:~#
Turns out, slurmctld on the master node (hostname: server1) and slurmd on the slave nodes (hostnames: server2 & server3) are throwing some errors probably related to networking: A few lines before and after the first occurence of the error in slurmctld.log on the master node - it's the only type of error I have noticed in the logs (pastebin to the entire log https://pastebin.com/GBSWXZJR):
root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error" slurmctld.log[2024-07-26T13:13:49.579] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions[2024-07-26T13:13:49.580] debug: power_save module disabled, SuspendTime < 0[2024-07-26T13:13:49.580] Running as primary controller[2024-07-26T13:13:49.580] debug: No backup controllers, not launching heartbeat.[2024-07-26T13:13:49.580] debug: priority/basic: init: Priority BASIC plugin loaded[2024-07-26T13:13:49.580] No parameter for mcs plugin, default values set[2024-07-26T13:13:49.580] mcs: MCSParameters = (null). ondemand set.[2024-07-26T13:13:49.580] debug: mcs/none: init: mcs none plugin loaded[2024-07-26T13:13:49.580] debug2: slurmctld listening on 0.0.0.0:6817[2024-07-26T13:13:52.662] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded[2024-07-26T13:13:52.662] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0[2024-07-26T13:13:52.662] debug: gres/gpu: init: loaded[2024-07-26T13:13:52.662] debug: validate_node_specs: node server1 registered with 0 jobs[2024-07-26T13:13:52.662] debug2: _slurm_rpc_node_registration complete for server1 usec=229[2024-07-26T13:13:53.586] debug: Spawning registration agent for server[2-3] 2 hosts[2024-07-26T13:13:53.586] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2024-07-26T13:13:53.586] debug: sched: Running job scheduler for default depth.[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS[2024-07-26T13:13:53.587] debug2: Tree head got back 0 looking for 2[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.132:6818: Connection refused[2024-07-26T13:13:54.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.589] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused
The connections to 10.36.17.166:6818 and 10.36.17.132:6818 are refused. Those are ports specified by the 'SlurmdPort' key in slurm.conf
There are similar errors in the slurmd.log files on both the slave nodes as well: slurmd.log on server2, the error is only at the end of the file (pastebin to the entire log https://pastebin.com/TwSMiAp7):
root@server2:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.018] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.018] error: Unable to bind listen port (6818): Address already in use
slurmd.log on server3 (pastebin to the entire log https://pastebin.com/K55cAGLb):
root@server3:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.384] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.384] error: Unable to bind listen port (6818): Address already in use
I use this script to restart slurm whenever I change any of the configs, could the order in which these operations are being done cause the problems I'm facing:
#! /bin/bashscp /etc/slurm/slurm.conf /etc/slurm/gres.conf server2:/etc/slurm/ && echo copied slurm.conf and gres.conf to server2;scp /etc/slurm/slurm.conf /etc/slurm/gres.conf server3:/etc/slurm/ && echo copied slurm.conf and gres.conf to server3;echoecho restarting slurmctld and slurmd on server1(scontrol shutdown ; sleep 3 ; rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmctld -d ; sleep 3 ; slurmd) && echo doneecho restarting slurmd on server2(ssh server2 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo doneecho restarting slurmd on server3(ssh server3 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo done
Config files: slurm.conf without the comments:
root@server1:/etc/slurm# grep -v "#" slurm.conf ClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=debug2SlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug2SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf:
root@server1:/etc/slurm# cat gres.confNodeName=server1 Name=gpu File=/dev/nvidia0NodeName=server2 Name=gpu File=/dev/nvidia0NodeName=server3 Name=gpu File=/dev/nvidia0
These config files are the same on all 3 computers.
As a complete beginner to Linux and Slurm administration, I have been struggling to understand even the most basic documentation, and I have been unable to find answers online. Any assistance would be greatly appreciated.
Thanks!