Hi,
I am trying to set up Slurm with GPUs as GRES on a 3 node configuration (hostnames: server1, server2, server3).
For a while everything looked fine and I was able to run
srun --label --nodes=3 hostname
which is what I use to test if Slurm is working correctly, and then it randomly stops.
Turns out slurmctld is not working and it throws the following error (the two lines are consecutive in the log file):
root@server1:/var/log# grep -i error slurmctld.log
[2024-07-22T14:47:32.302] error: Error binding slurm stream socket: Address already in use
[2024-07-22T14:47:32.302] fatal: slurm_init_msg_engine_port error Address already in use
This error is being thrown after having made no changes to the config files, in fact the cluster wasn't used at all for a few weeks before this error was thrown.
This is the simple script I use to restart Slurm:
root@server1:~# cat slurmRestart.sh
#! /bin/bash
scp /etc/slurm/slurm.conf server2:/etc/slurm/ && echo copied slurm.conf to server2;
scp /etc/slurm/slurm.conf server3:/etc/slurm/ && echo copied slurm.conf to server3;
rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld ; echo restarting slurm on server1;
(ssh server2 "rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld") && echo restarting slurm on server2;
(ssh server3 "rm /var/log/slurmd.log /var/log/slurmctld.log ; systemctl restart slurmd slurmctld") && echo restarting slurm on server3;
Could the error be due to the slurmd and/or slurmctld not being started in the right order?
The other question I have is regarding the configuration of a GPU as a GRES - how do I verify that it has been configured correctly? I was told to use srun nvidia-smi with and without having enabled GPU use, but whether or not I enable GPU usage has no effect on the output of the command:
root@server1:~# srun --nodes=1 nvidia-smi --query-gpu=uuid --format=csv
uuid
GPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005
root@server1:~#
root@server1:~# srun --nodes=1 --gpus-per-node=1 nvidia-smi --query-gpu=uuid --format=csv
uuid
GPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005
I am sceptical if about whether the GPU has properly been configured, is this the best way to check if it has?
The error:
I first noticed this happening when I tried to run the command I usually use to see if everything is fine, the srun command runs only one node, and the only way to stop it if I specify the number of nodes as 3 is to press Ctrl+C:
root@server1:~# srun --label --nodes=1 hostname
0: server1
root@server1:~# ssh server2 "srun --label --nodes=1 hostname"
0: server1
root@server1:~# ssh server3 "srun --label --nodes=1 hostname"
0: server1
root@server1:~# srun --label --nodes=3 hostname
srun: Required node not available (down, drained or reserved)
srun: job 265 queued and waiting for resources
^Csrun: Job allocation 265 has been revoked
srun: Force Terminated JobId=265
root@server1:~# ssh server2 "srun --label --nodes=3 hostname"
srun: Required node not available (down, drained or reserved)
srun: job 266 queued and waiting for resources
^Croot@server1:~# ssh server3 "srun --label --nodes=3 hostname"
srun: Required node not available (down, drained or reserved)
srun: job 267 queued and waiting for resources
root@server1:~#
The logs:
1) The last 30 lines of
/var/log/slurmctld.log at the debug5 level in server #1 (pastebin to the
entire log):
root@server1:/var/log# tail -30 slurmctld.log
[2024-07-22T14:47:32.301] debug: Updating partition uid access list
[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/resv_state` as buf_t
[2024-07-22T14:47:32.301] debug3: Version string in resv_state header is PROTOCOL_VERSION
[2024-07-22T14:47:32.301] Recovered state of 0 reservations
[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/trigger_state` as buf_t
[2024-07-22T14:47:32.301] State of 0 triggers recovered
[2024-07-22T14:47:32.301] read_slurm_conf: backup_controller not specified
[2024-07-22T14:47:32.301] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-07-22T14:47:32.301] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-07-22T14:47:32.301] debug: power_save module disabled, SuspendTime < 0
[2024-07-22T14:47:32.301] Running as primary controller
[2024-07-22T14:47:32.301] debug: No backup controllers, not launching heartbeat.
[2024-07-22T14:47:32.301] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/priority_basic.so
[2024-07-22T14:47:32.301] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Priority BASIC plugin type:priority/basic version:0x160508
[2024-07-22T14:47:32.301] debug: priority/basic: init: Priority BASIC plugin loaded
[2024-07-22T14:47:32.301] debug3: Success.
[2024-07-22T14:47:32.301] No parameter for mcs plugin, default values set
[2024-07-22T14:47:32.301] mcs: MCSParameters = (null). ondemand set.
[2024-07-22T14:47:32.301] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/mcs_none.so
[2024-07-22T14:47:32.301] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mcs none plugin type:mcs/none version:0x160508
[2024-07-22T14:47:32.301] debug: mcs/none: init: mcs none plugin loaded
[2024-07-22T14:47:32.301] debug3: Success.
[2024-07-22T14:47:32.302] debug3: _slurmctld_rpc_mgr pid = 3159324
[2024-07-22T14:47:32.302] debug3: _slurmctld_background pid = 3159324
[2024-07-22T14:47:32.302] error: Error binding slurm stream socket: Address already in use
[2024-07-22T14:47:32.302] fatal: slurm_init_msg_engine_port error Address already in use
[2024-07-22T14:47:32.304] slurmscriptd: debug3: Called _handle_close
[2024-07-22T14:47:32.304] slurmscriptd: debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.304] slurmscriptd: debug3: Called _msg_readable
[2024-07-22T14:47:32.304] slurmscriptd: debug: _slurmscriptd_mainloop: finished
2) Entirety of slurmctld.log on server #2:
root@server2:/var/log# cat slurmctld.log
[2024-07-22T14:47:32.614] debug: slurmctld log levels: stderr=debug5 logfile=debug5 syslog=quiet
[2024-07-22T14:47:32.614] debug: Log file re-opened
[2024-07-22T14:47:32.615] slurmscriptd: debug: slurmscriptd: Got ack from slurmctld, initialization successful
[2024-07-22T14:47:32.615] slurmscriptd: debug: _slurmscriptd_mainloop: started
[2024-07-22T14:47:32.616] slurmscriptd: debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.616] debug: slurmctld: slurmscriptd fork()'d and initialized.
[2024-07-22T14:47:32.616] slurmscriptd: debug3: Called _msg_readable
[2024-07-22T14:47:32.616] debug: _slurmctld_listener_thread: started listening to slurmscriptd
[2024-07-22T14:47:32.616] debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.616] debug3: Called _msg_readable
[2024-07-22T14:47:32.616] slurmctld version 22.05.8 started on cluster dlabcluster
[2024-07-22T14:47:32.616] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so
[2024-07-22T14:47:32.616] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x160508
[2024-07-22T14:47:32.616] cred/munge: init: Munge credential signature plugin loaded
[2024-07-22T14:47:32.616] debug3: Success.
[2024-07-22T14:47:32.616] error: This host (server2/server2) not a valid controller
[2024-07-22T14:47:32.617] slurmscriptd: debug3: Called _handle_close
[2024-07-22T14:47:32.617] slurmscriptd: debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.617] slurmscriptd: debug3: Called _msg_readable
[2024-07-22T14:47:32.617] slurmscriptd: debug: _slurmscriptd_mainloop: finished
3) Entirety of slurmctld.log on server #3:
root@server3:/var/log# cat slurmctld.log
[2024-07-22T14:47:32.927] debug: slurmctld log levels: stderr=debug5 logfile=debug5 syslog=quiet
[2024-07-22T14:47:32.927] debug: Log file re-opened
[2024-07-22T14:47:32.928] slurmscriptd: debug: slurmscriptd: Got ack from slurmctld, initialization successful
[2024-07-22T14:47:32.928] slurmscriptd: debug: _slurmscriptd_mainloop: started
[2024-07-22T14:47:32.928] slurmscriptd: debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.928] debug: slurmctld: slurmscriptd fork()'d and initialized.
[2024-07-22T14:47:32.928] slurmscriptd: debug3: Called _msg_readable
[2024-07-22T14:47:32.928] slurmctld version 22.05.8 started on cluster dlabcluster
[2024-07-22T14:47:32.929] debug: _slurmctld_listener_thread: started listening to slurmscriptd
[2024-07-22T14:47:32.929] debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.929] debug3: Called _msg_readable
[2024-07-22T14:47:32.929] debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so
[2024-07-22T14:47:32.929] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x160508
[2024-07-22T14:47:32.929] cred/munge: init: Munge credential signature plugin loaded
[2024-07-22T14:47:32.929] debug3: Success.
[2024-07-22T14:47:32.929] error: This host (server3/server3) not a valid controller
[2024-07-22T14:47:32.930] slurmscriptd: debug3: Called _handle_close
[2024-07-22T14:47:32.930] slurmscriptd: debug4: eio: handling events for 1 objects
[2024-07-22T14:47:32.930] slurmscriptd: debug3: Called _msg_readable
[2024-07-22T14:47:32.930] slurmscriptd: debug: _slurmscriptd_mainloop: finished
The config files (shared by all 3 computers):
1) /etc/slurm/slurm.conf without the comments:
root@server1:/etc/slurm# grep -v "#" slurm.conf
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
2) /etc/slurm/gres.conf:
root@server1:/etc/slurm# cat gres.conf
NodeName=server1 Name=gpu File=/dev/nvidia0
NodeName=server2 Name=gpu File=/dev/nvidia0
NodeName=server3 Name=gpu File=/dev/nvidia0
These files are the same on all 3 computers:
root@server1:/etc/slurm# diff slurm.conf <(ssh server2 "cat /etc/slurm/slurm.conf")
root@server1:/etc/slurm# diff slurm.conf <(ssh server3 "cat /etc/slurm/slurm.conf")
root@server1:/etc/slurm# diff gres.conf <(ssh server2 "cat /etc/slurm/gres.conf")
root@server1:/etc/slurm# diff gres.conf <(ssh server3 "cat /etc/slurm/gres.conf")
root@server1:/etc/slurm#
Thank you,
Shookti