Dear community,
I am having a strange issue I have been unable to find the cause. Last week
I did a full update on the cluster, which is composed of the master node,
and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU
server). After the update, I got
- master node ended up with Ubuntu 24.04,
- nodeGPU01 with latest DGX OS (still Ubuntu 22.04)
- nodeGPU02 with Ubuntu 24.04 LTS.
- Launching jobs from master choosing the partitions of nodeGPU01 works
perfectly.
- Launching jobs from master choosing the partition of nodeGPU02 stopped
working (hangs).
The nodeGPU02 (Ubuntu 24) is no longer processing jobs successfully, while
the other nodeGPU01 works perfectly even when the master has Ubuntu 24.
Any help is welcome, I have tried many things and had no success in finding
the cause of this. Please let me know if you need more information. Many
thanks in advance.
This is the initial `slurmd` log of the problematic node (nodeGPU02),
notice the messages in yellow
➜ ~ sudo systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; preset:
enabled)
Active: active (running) since Sat 2024-09-28 14:00:22 -03; 4s ago
Main PID: 4821 (slurmd)
Tasks: 1
Memory: 17.0M (peak: 29.7M)
CPU: 174ms
CGroup: /system.slice/slurmd.service
└─4821 /usr/sbin/slurmd -D -s
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: MPI: Loading all
types
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init:
PMIx plugin loaded
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init:
PMIx plugin loaded
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: No mpi.conf file
(/etc/slurm/mpi.conf)
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: slurmd started on Sat, 28
Sep 2024 14:00:25 -0300
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect:
connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection
refused
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: health_check
success rc:0 output:
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: CPUs=128 Boards=1 Sockets=2
Cores=64 Threads=1 Memory=773744 TmpDisk=899181 Uptime=2829
CPUSpecList=(null) FeaturesAvail=(nu>
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect:
connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection
refused
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:
_handle_node_reg_resp: slurmctld sent back 11 TRES
This is the verbose output of the srun command (notice yellow messages).
➜ ~ srun -vvvp rtx hostname
srun: defined options
srun: -------------------- --------------------
srun: partition : rtx
srun: verbose : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=8388608
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=3090276
srun: debug: propagating RLIMIT_NOFILE=1024
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0002
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 34081
srun: debug: Entering _msg_thr_internal
srun: Waiting for resource configuration
srun: Nodes nodeGPU02 are ready for job
srun: jobid 57463: nodes(1):`nodeGPU02', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:1 is not a gres:
srun: debug: requesting job 57463, user 99, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name hostname, relative 65534
srun: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v4: pmixp_abort_agent_start: (null) [0]:
pmixp_agent.c:382: Abort agent port: 41393
srun: debug: mpi/pmix_v4: mpi_p_client_prelaunch: (null) [0]:
mpi_pmix.c:285: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]:
pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 33223
srun: debug: Started IO server thread (140079189182144)
srun: debug: Entering _launch_tasks
srun: launching StepId=57463.0 on host nodeGPU02, 1 tasks: 0
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: route/default: init: route default plugin loaded
srun: debug2: Called _file_writable
srun: topology/none: init: topology NONE plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug: slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msgs: [[nodeGPU02]:6818] failed: Socket timed
out on send/recv operation
srun: debug2: Tree head got back 1
srun: debug: launch returned msg_rc=1001 err=5004 type=9001
srun: debug2: marking task 0 done on failed node 0
srun: error: Task launch for StepId=57463.0 failed on node nodeGPU02:
Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted
srun: debug2: false, shutdown
srun: debug2: false, shutdown
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: false, shutdown
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v4: _conn_readable: (null) [0]: pmixp_agent.c:105:
false, shutdown
srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]:
pmixp_agent.c:355: Abort thread exit
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug2: false, shutdown
srun: debug: Leaving _msg_thr_internal
srun: debug2: spank: spank_pyxis.so: exit = 0
This is the `tail -f` log of slurmctld when launching a simple `srun
hostname`
[2024-09-28T14:08:10.264] ====================
[2024-09-28T14:08:10.264] JobId=57463 nhosts:1 ncpus:1 node_req:1
nodes=nodeGPU02
[2024-09-28T14:08:10.264] Node[0]:
[2024-09-28T14:08:10.264] Mem(MB):65536:0 Sockets:2 Cores:64 CPUs:1:0
[2024-09-28T14:08:10.264] Socket[0] Core[0] is allocated
[2024-09-28T14:08:10.264] --------------------
[2024-09-28T14:08:10.264] cpu_array_value[0]:1 reps:1
[2024-09-28T14:08:10.264] ====================
[2024-09-28T14:08:10.264] gres/gpu: state for nodeGPU02
[2024-09-28T14:08:10.264] gres_cnt found:3 configured:3 avail:3 alloc:0
[2024-09-28T14:08:10.264] gres_bit_alloc: of 3
[2024-09-28T14:08:10.264] gres_used:(null)
[2024-09-28T14:08:10.264] topo[0]:(null)(0)
[2024-09-28T14:08:10.264] topo_core_bitmap[0]:0-63 of 128
[2024-09-28T14:08:10.264] topo_gres_bitmap[0]:0 of 3
[2024-09-28T14:08:10.264] topo_gres_cnt_alloc[0]:0
[2024-09-28T14:08:10.264] topo_gres_cnt_avail[0]:1
[2024-09-28T14:08:10.264] topo[1]:(null)(0)
[2024-09-28T14:08:10.264] topo_core_bitmap[1]:0-63 of 128
[2024-09-28T14:08:10.264] topo_gres_bitmap[1]:1 of 3
[2024-09-28T14:08:10.264] topo_gres_cnt_alloc[1]:0
[2024-09-28T14:08:10.264] topo_gres_cnt_avail[1]:1
[2024-09-28T14:08:10.264] topo[2]:(null)(0)
[2024-09-28T14:08:10.264] topo_core_bitmap[2]:0-63 of 128
[2024-09-28T14:08:10.264] topo_gres_bitmap[2]:2 of 3
[2024-09-28T14:08:10.264] topo_gres_cnt_alloc[2]:0
[2024-09-28T14:08:10.264] topo_gres_cnt_avail[2]:1
[2024-09-28T14:08:10.265] sched: _slurm_rpc_allocate_resources JobId=57463
NodeList=nodeGPU02 usec=1339
[2024-09-28T14:08:10.368] ====================
[2024-09-28T14:08:10.368] JobId=57463 StepId=0
[2024-09-28T14:08:10.368] JobNode[0] Socket[0] Core[0] is allocated
[2024-09-28T14:08:10.368] ====================
[2024-09-28T14:08:30.409] _job_complete: JobId=57463 WTERMSIG 12
[2024-09-28T14:08:30.410] gres/gpu: state for nodeGPU02
[2024-09-28T14:08:30.410] gres_cnt found:3 configured:3 avail:3 alloc:0
[2024-09-28T14:08:30.410] gres_bit_alloc: of 3
[2024-09-28T14:08:30.410] gres_used:(null)
[2024-09-28T14:08:30.410] topo[0]:(null)(0)
[2024-09-28T14:08:30.410] topo_core_bitmap[0]:0-63 of 128
[2024-09-28T14:08:30.410] topo_gres_bitmap[0]:0 of 3
[2024-09-28T14:08:30.410] topo_gres_cnt_alloc[0]:0
[2024-09-28T14:08:30.410] topo_gres_cnt_avail[0]:1
[2024-09-28T14:08:30.410] topo[1]:(null)(0)
[2024-09-28T14:08:30.410] topo_core_bitmap[1]:0-63 of 128
[2024-09-28T14:08:30.410] topo_gres_bitmap[1]:1 of 3
[2024-09-28T14:08:30.410] topo_gres_cnt_alloc[1]:0
[2024-09-28T14:08:30.410] topo_gres_cnt_avail[1]:1
[2024-09-28T14:08:30.410] topo[2]:(null)(0)
[2024-09-28T14:08:30.410] topo_core_bitmap[2]:0-63 of 128
[2024-09-28T14:08:30.410] topo_gres_bitmap[2]:2 of 3
[2024-09-28T14:08:30.410] topo_gres_cnt_alloc[2]:0
[2024-09-28T14:08:30.410] topo_gres_cnt_avail[2]:1
[2024-09-28T14:08:30.410] _job_complete: JobId=57463 done
[2024-09-28T14:08:58.687] gres/gpu: state for nodeGPU01
[2024-09-28T14:08:58.687] gres_cnt found:8 configured:8 avail:8 alloc:0
[2024-09-28T14:08:58.687] gres_bit_alloc: of 8
[2024-09-28T14:08:58.687] gres_used:(null)
[2024-09-28T14:08:58.687] topo[0]:A100(808464705)
[2024-09-28T14:08:58.687] topo_core_bitmap[0]:48-63 of 128
[2024-09-28T14:08:58.687] topo_gres_bitmap[0]:0 of 8
[2024-09-28T14:08:58.687] topo_gres_cnt_alloc[0]:0
[2024-09-28T14:08:58.687] topo_gres_cnt_avail[0]:1
[2024-09-28T14:08:58.687] topo[1]:A100(808464705)
[2024-09-28T14:08:58.687] topo_core_bitmap[1]:48-63 of 128
[2024-09-28T14:08:58.687] topo_gres_bitmap[1]:1 of 8
[2024-09-28T14:08:58.687] topo_gres_cnt_alloc[1]:0
[2024-09-28T14:08:58.687] topo_gres_cnt_avail[1]:1
[2024-09-28T14:08:58.687] topo[2]:A100(808464705)
[2024-09-28T14:08:58.687] topo_core_bitmap[2]:16-31 of 128
[2024-09-28T14:08:58.687] topo_gres_bitmap[2]:2 of 8
[2024-09-28T14:08:58.687] topo_gres_cnt_alloc[2]:0
[2024-09-28T14:08:58.687] topo_gres_cnt_avail[2]:1
[2024-09-28T14:08:58.687] topo[3]:A100(808464705)
[2024-09-28T14:08:58.687] topo_core_bitmap[3]:16-31 of 128
[2024-09-28T14:08:58.688] topo_gres_bitmap[3]:3 of 8
[2024-09-28T14:08:58.688] topo_gres_cnt_alloc[3]:0
[2024-09-28T14:08:58.688] topo_gres_cnt_avail[3]:1
[2024-09-28T14:08:58.688] topo[4]:A100(808464705)
[2024-09-28T14:08:58.688] topo_core_bitmap[4]:112-127 of 128
[2024-09-28T14:08:58.688] topo_gres_bitmap[4]:4 of 8
[2024-09-28T14:08:58.688] topo_gres_cnt_alloc[4]:0
[2024-09-28T14:08:58.688] topo_gres_cnt_avail[4]:1
[2024-09-28T14:08:58.688] topo[5]:A100(808464705)
[2024-09-28T14:08:58.688] topo_core_bitmap[5]:112-127 of 128
[2024-09-28T14:08:58.688] topo_gres_bitmap[5]:5 of 8
[2024-09-28T14:08:58.688] topo_gres_cnt_alloc[5]:0
[2024-09-28T14:08:58.688] topo_gres_cnt_avail[5]:1
[2024-09-28T14:08:58.688] topo[6]:A100(808464705)
[2024-09-28T14:08:58.688] topo_core_bitmap[6]:80-95 of 128
[2024-09-28T14:08:58.688] topo_gres_bitmap[6]:6 of 8
[2024-09-28T14:08:58.688] topo_gres_cnt_alloc[6]:0
[2024-09-28T14:08:58.688] topo_gres_cnt_avail[6]:1
[2024-09-28T14:08:58.688] topo[7]:A100(808464705)
[2024-09-28T14:08:58.688] topo_core_bitmap[7]:80-95 of 128
[2024-09-28T14:08:58.688] topo_gres_bitmap[7]:7 of 8
[2024-09-28T14:08:58.688] topo_gres_cnt_alloc[7]:0
[2024-09-28T14:08:58.688] topo_gres_cnt_avail[7]:1
[2024-09-28T14:08:58.688] type[0]:A100(808464705)
[2024-09-28T14:08:58.688] type_cnt_alloc[0]:0
[2024-09-28T14:08:58.688] type_cnt_avail[0]:8
[2024-09-28T14:08:58.690] gres/gpu: state for nodeGPU02
[2024-09-28T14:08:58.690] gres_cnt found:3 configured:3 avail:3 alloc:0
[2024-09-28T14:08:58.690] gres_bit_alloc: of 3
[2024-09-28T14:08:58.690] gres_used:(null)
[2024-09-28T14:08:58.690] topo[0]:(null)(0)
[2024-09-28T14:08:58.690] topo_core_bitmap[0]:0-63 of 128
[2024-09-28T14:08:58.690] topo_gres_bitmap[0]:0 of 3
[2024-09-28T14:08:58.690] topo_gres_cnt_alloc[0]:0
[2024-09-28T14:08:58.690] topo_gres_cnt_avail[0]:1
[2024-09-28T14:08:58.690] topo[1]:(null)(0)
[2024-09-28T14:08:58.690] topo_core_bitmap[1]:0-63 of 128
[2024-09-28T14:08:58.690] topo_gres_bitmap[1]:1 of 3
[2024-09-28T14:08:58.690] topo_gres_cnt_alloc[1]:0
[2024-09-28T14:08:58.690] topo_gres_cnt_avail[1]:1
[2024-09-28T14:08:58.690] topo[2]:(null)(0)
[2024-09-28T14:08:58.690] topo_core_bitmap[2]:0-63 of 128
[2024-09-28T14:08:58.690] topo_gres_bitmap[2]:2 of 3
[2024-09-28T14:08:58.690] topo_gres_cnt_alloc[2]:0
[2024-09-28T14:08:58.690] topo_gres_cnt_avail[2]:1
[2024-09-28T14:09:49.763] Resending TERMINATE_JOB request JobId=57463
Nodelist=nodeGPU02
This is the `tail -f` log of slurmd when launching the job from master,
notice the messages in yellow
[2024-09-28T14:08:10.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
[2024-09-28T14:08:10.321] debug2: prep/script: _run_subpath_command: prolog
success rc:0 output:
[2024-09-28T14:08:10.323] debug2: Finish processing RPC:
REQUEST_LAUNCH_PROLOG
[2024-09-28T14:08:10.377] debug: Checking credential with 720 bytes of sig
data
[2024-09-28T14:08:10.377] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2024-09-28T14:08:10.377] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2024-09-28T14:08:10.377] launch task StepId=57463.0 request from UID:10082
GID:10088 HOST:10.10.0.1 PORT:36478
[2024-09-28T14:08:10.377] CPU_BIND: JobNode[0] CPU[0] Step alloc
[2024-09-28T14:08:10.377] CPU_BIND: ====================
[2024-09-28T14:08:10.377] CPU_BIND: Memory extracted from credential for
StepId=57463.0 job_mem_limit=65536 step_mem_limit=65536
[2024-09-28T14:08:10.377] debug: Waiting for job 57463's prolog to complete
[2024-09-28T14:08:10.377] debug: Finished wait for job 57463's prolog to
complete
[2024-09-28T14:08:10.378] error: _send_slurmstepd_init failed
[2024-09-28T14:08:10.384] debug2: debug level read from slurmd is 'debug2'.
[2024-09-28T14:08:10.385] debug2: _read_slurmd_conf_lite: slurmd sent 11
TRES.
[2024-09-28T14:08:10.385] debug2: Received CPU frequency information for
128 CPUs
[2024-09-28T14:08:10.385] select/cons_tres: common_init: select/cons_tres
loaded
[2024-09-28T14:08:10.385] debug: switch/none: init: switch NONE plugin
loaded
[2024-09-28T14:08:10.385] [57463.0] debug: auth/munge: init: loaded
[2024-09-28T14:08:10.385] [57463.0] debug: Reading cgroup.conf file
/etc/slurm/cgroup.conf
[2024-09-28T14:08:10.395] [57463.0] debug: cgroup/v2: init: Cgroup v2
plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug: hash/k12: init: init:
KangarooTwelve hash plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_energy/none: init:
AcctGatherEnergy NONE plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_profile/none: init:
AcctGatherProfile NONE plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_interconnect/none:
init: AcctGatherInterconnect NONE plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_filesystem/none:
init: AcctGatherFilesystem NONE plugin loaded
[2024-09-28T14:08:10.396] [57463.0] debug2: Reading acct_gather.conf file
/etc/slurm/acct_gather.conf
[2024-09-28T14:08:10.396] [57463.0] debug2: hwloc_topology_init
[2024-09-28T14:08:10.399] [57463.0] debug2: xcpuinfo_hwloc_topo_load: xml
file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
[2024-09-28T14:08:10.400] [57463.0] debug: CPUs:128 Boards:1 Sockets:2
CoresPerSocket:64 ThreadsPerCore:1
[2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: core
enforcement enabled
[2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup:
task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:773744M
allowed:100%(enforced), swap:0%(enforced), max:100%(773744M)
max+swap:0%(773744M) min:30M kmem:100%(773744M permissive) min:30M
[2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: memory
enforcement enabled
[2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: device
enforcement enabled
[2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: Tasks
containment cgroup plugin loaded
[2024-09-28T14:08:10.401] [57463.0] debug: jobacct_gather/linux: init: Job
accounting gather LINUX plugin loaded
[2024-09-28T14:08:10.401] [57463.0] cred/munge: init: Munge credential
signature plugin loaded
[2024-09-28T14:08:10.401] [57463.0] debug: job_container/none: init:
job_container none plugin loaded
[2024-09-28T14:08:10.401] [57463.0] debug: gres/gpu: init: loaded
[2024-09-28T14:08:10.401] [57463.0] debug: gpu/generic: init: init: GPU
Generic plugin loaded
[2024-09-28T14:08:30.415] debug2: Start processing RPC:
REQUEST_TERMINATE_JOB
[2024-09-28T14:08:30.415] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-09-28T14:08:30.415] debug: _rpc_terminate_job: uid = 777 JobId=57463
[2024-09-28T14:08:30.415] debug: credential for job 57463 revoked
[2024-09-28T14:08:30.415] debug: sent SUCCESS, waiting for step to start
[2024-09-28T14:08:30.415] debug: Blocked waiting for JobId=57463, all steps
[2024-09-28T14:08:58.688] debug2: Start processing RPC:
REQUEST_NODE_REGISTRATION_STATUS
[2024-09-28T14:08:58.689] debug2: Processing RPC:
REQUEST_NODE_REGISTRATION_STATUS
[2024-09-28T14:08:58.689] debug: _step_connect: connect() failed for
/var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused
[2024-09-28T14:08:58.692] debug: _handle_node_reg_resp: slurmctld sent
back 11 TRES.
[2024-09-28T14:08:58.692] debug2: Finish processing RPC:
REQUEST_NODE_REGISTRATION_STATUS
--
Cristóbal A. Navarro