Hi,
I am trying to set up Slurm with GPUs as GRES on a 3 node configuration (hostnames: server1, server2, server3).
For a while everything looked fine and I was able to run
srun --label --nodes=3 hostname
which is what I use to test if Slurm is working correctly, and then it randomly stops.
Turns out slurmctld is not working and it throws the following error (the two lines are consecutive in the log file):
root@server1:/var/log# grep -i error slurmctld.log [2024-07-22T14:47:32.302] error: Error binding slurm stream socket: Address already in use[2024-07-22T14:47:32.302] fatal: slurm_init_msg_engine_port error Address already in use
This error is being thrown after having made no changes to the config files, in fact the cluster wasn't used at all for a few weeks before this error was thrown.
This is the simple script I use to restart Slurm:
root@server1:~# cat slurmRestart.sh #! /bin/bashscp /etc/slurm/slurm.conf server2:/etc/slurm/ && echo copied slurm.conf to server2;scp /etc/slurm/slurm.conf server3:/etc/slurm/ && echo copied slurm.conf to server3;rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld ; echo restarting slurm on server1;(ssh server2 "rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld") && echo restarting slurm on server2;(ssh server3 "rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld") && echo restarting slurm on server3;
Could the error be due to the slurmd and/or slurmctld not being started in the right order?
The other question I have is regarding the configuration of a GPU as a GRES - how do I verify that it has been configured correctly? I was told to use srun nvidia-smi with and without having enabled GPU use, but whether or not I enable GPU usage has no effect on the output of the command:
root@server1:~# srun --nodes=1 nvidia-smi --query-gpu=uuid --format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005root@server1:~# root@server1:~# srun --nodes=1 --gpus-per-node=1 nvidia-smi --query-gpu=uuid --format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005
I am sceptical if about whether the GPU has properly been configured, is this the best way to check if it has?
*The error:* I first noticed this happening when I tried to run the command I usually use to see if everything is fine, the srun command runs only one node, and the only way to stop it if I specify the number of nodes as 3 is to press Ctrl+C:
root@server1:~# srun --label --nodes=1 hostname0: server1root@server1:~# ssh server2 "srun --label --nodes=1 hostname"0: server1root@server1:~# ssh server3 "srun --label --nodes=1 hostname"0: server1root@server1:~# srun --label --nodes=3 hostnamesrun: Required node not available (down, drained or reserved)srun: job 265 queued and waiting for resources^Csrun: Job allocation 265 has been revokedsrun: Force Terminated JobId=265root@server1:~# ssh server2 "srun --label --nodes=3 hostname"srun: Required node not available (down, drained or reserved)srun: job 266 queued and waiting for resources^Croot@server1:~# ssh server3 "srun --label --nodes=3 hostname"srun: Required node not available (down, drained or reserved)srun: job 267 queued and waiting for resourcesroot@server1:~#
*The logs:* 1) The last 30 lines of */var/log/slurmctld.log* at the debug5 level in server #1 (pastebin to the entire log https://pastebin.com/fw4C4xtr):
root@server1:/var/log# tail -30 slurmctld.log [2024-07-22T14:47:32.301] debug: Updating partition uid access list[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/resv_state` as buf_t[2024-07-22T14:47:32.301] debug3: Version string in resv_state header is PROTOCOL_VERSION[2024-07-22T14:47:32.301] Recovered state of 0 reservations[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/trigger_state` as buf_t[2024-07-22T14:47:32.301] State of 0 triggers recovered[2024-07-22T14:47:32.301] read_slurm_conf: backup_controller not specified[2024-07-22T14:47:32.301] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure[2024-07-22T14:47:32.301] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions[2024-07-22T14:47:32.301] debug: power_save module disabled, SuspendTime < 0[2024-07-22T14:47:32.301] Running as primary controller[2024-07-22T14:47:32.301] debug: No backup controllers, not launching heartbeat.[2024-07-22T14:47:32.301] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/priority_basic.so[2024-07-22T14:47:32.301] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Priority BASIC plugin type:priority/basic version:0x160508[2024-07-22T14:47:32.301] debug: priority/basic: init: Priority BASIC plugin loaded[2024-07-22T14:47:32.301] debug3: Success.[2024-07-22T14:47:32.301] No parameter for mcs plugin, default values set[2024-07-22T14:47:32.301] mcs: MCSParameters = (null). ondemand set.[2024-07-22T14:47:32.301] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/mcs_none.so[2024-07-22T14:47:32.301] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mcs none plugin type:mcs/none version:0x160508[2024-07-22T14:47:32.301] debug: mcs/none: init: mcs none plugin loaded[2024-07-22T14:47:32.301] debug3: Success.[2024-07-22T14:47:32.302] debug3: _slurmctld_rpc_mgr pid = 3159324[2024-07-22T14:47:32.302] debug3: _slurmctld_background pid = 3159324[2024-07-22T14:47:32.302] error: Error binding slurm stream socket: Address already in use[2024-07-22T14:47:32.302] fatal: slurm_init_msg_engine_port error Address already in use[2024-07-22T14:47:32.304] slurmscriptd: debug3: Called _handle_close[2024-07-22T14:47:32.304] slurmscriptd: debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.304] slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.304] slurmscriptd: debug: _slurmscriptd_mainloop: finished
2) Entirety of *slurmctld.log on server #2*:
root@server2:/var/log# cat slurmctld.log [2024-07-22T14:47:32.614] debug: slurmctld log levels: stderr=debug5 logfile=debug5 syslog=quiet[2024-07-22T14:47:32.614] debug: Log file re-opened[2024-07-22T14:47:32.615] slurmscriptd: debug: slurmscriptd: Got ack from slurmctld, initialization successful[2024-07-22T14:47:32.615] slurmscriptd: debug: _slurmscriptd_mainloop: started[2024-07-22T14:47:32.616] slurmscriptd: debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.616] debug: slurmctld: slurmscriptd fork()'d and initialized.[2024-07-22T14:47:32.616] slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.616] debug: _slurmctld_listener_thread: started listening to slurmscriptd[2024-07-22T14:47:32.616] debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.616] debug3: Called _msg_readable[2024-07-22T14:47:32.616] slurmctld version 22.05.8 started on cluster dlabcluster[2024-07-22T14:47:32.616] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so[2024-07-22T14:47:32.616] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x160508[2024-07-22T14:47:32.616] cred/munge: init: Munge credential signature plugin loaded[2024-07-22T14:47:32.616] debug3: Success.[2024-07-22T14:47:32.616] error: This host (server2/server2) not a valid controller[2024-07-22T14:47:32.617] slurmscriptd: debug3: Called _handle_close[2024-07-22T14:47:32.617] slurmscriptd: debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.617] slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.617] slurmscriptd: debug: _slurmscriptd_mainloop: finished
3) Entirety of *slurmctld.log on server #3*:
root@server3:/var/log# cat slurmctld.log [2024-07-22T14:47:32.927] debug: slurmctld log levels: stderr=debug5 logfile=debug5 syslog=quiet[2024-07-22T14:47:32.927] debug: Log file re-opened[2024-07-22T14:47:32.928] slurmscriptd: debug: slurmscriptd: Got ack from slurmctld, initialization successful[2024-07-22T14:47:32.928] slurmscriptd: debug: _slurmscriptd_mainloop: started[2024-07-22T14:47:32.928] slurmscriptd: debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.928] debug: slurmctld: slurmscriptd fork()'d and initialized.[2024-07-22T14:47:32.928] slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.928] slurmctld version 22.05.8 started on cluster dlabcluster[2024-07-22T14:47:32.929] debug: _slurmctld_listener_thread: started listening to slurmscriptd[2024-07-22T14:47:32.929] debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.929] debug3: Called _msg_readable[2024-07-22T14:47:32.929] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so[2024-07-22T14:47:32.929] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x160508[2024-07-22T14:47:32.929] cred/munge: init: Munge credential signature plugin loaded[2024-07-22T14:47:32.929] debug3: Success.[2024-07-22T14:47:32.929] error: This host (server3/server3) not a valid controller[2024-07-22T14:47:32.930] slurmscriptd: debug3: Called _handle_close[2024-07-22T14:47:32.930] slurmscriptd: debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.930] slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.930] slurmscriptd: debug: _slurmscriptd_mainloop: finished
*The config files (shared by all 3 computers):* 1) */etc/slurm/slurm.conf* without the comments:
root@server1:/etc/slurm# grep -v "#" slurm.conf ClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=debug5SlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug5SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
2) */etc/slurm/gres.conf*:
root@server1:/etc/slurm# cat gres.conf NodeName=server1 Name=gpu File=/dev/nvidia0NodeName=server2 Name=gpu File=/dev/nvidia0NodeName=server3 Name=gpu File=/dev/nvidia0
These files are the same on all 3 computers:
root@server1:/etc/slurm# diff slurm.conf <(ssh server2 "cat /etc/slurm/slurm.conf")root@server1:/etc/slurm# diff slurm.conf <(ssh server3 "cat /etc/slurm/slurm.conf")root@server1:/etc/slurm# diff gres.conf <(ssh server2 "cat /etc/slurm/gres.conf")root@server1:/etc/slurm# diff gres.conf <(ssh server3 "cat /etc/slurm/gres.conf")root@server1:/etc/slurm#
Thank you, Shookti