Dear community, I am having a strange issue I have been unable to find the cause. Last week I did a full update on the cluster, which is composed of the master node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU server). After the update, I got
- master node ended up with Ubuntu 24.04, - nodeGPU01 with latest DGX OS (still Ubuntu 22.04) - nodeGPU02 with Ubuntu 24.04 LTS. - Launching jobs from master choosing the partitions of nodeGPU01 works perfectly. - Launching jobs from master choosing the partition of nodeGPU02 stopped working (hangs).
The nodeGPU02 (Ubuntu 24) is no longer processing jobs successfully, while the other nodeGPU01 works perfectly even when the master has Ubuntu 24. Any help is welcome, I have tried many things and had no success in finding the cause of this. Please let me know if you need more information. Many thanks in advance.
This is the initial `slurmd` log of the problematic node (nodeGPU02), notice the messages in yellow
➜ ~ sudo systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; preset: enabled) Active: active (running) since Sat 2024-09-28 14:00:22 -03; 4s ago Main PID: 4821 (slurmd) Tasks: 1 Memory: 17.0M (peak: 29.7M) CPU: 174ms CGroup: /system.slice/slurmd.service └─4821 /usr/sbin/slurmd -D -s
Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: MPI: Loading all types Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: PMIx plugin loaded Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: PMIx plugin loaded Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: No mpi.conf file (/etc/slurm/mpi.conf) Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: slurmd started on Sat, 28 Sep 2024 14:00:25 -0300 Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: health_check success rc:0 output: Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: CPUs=128 Boards=1 Sockets=2 Cores=64 Threads=1 Memory=773744 TmpDisk=899181 Uptime=2829 CPUSpecList=(null) FeaturesAvail=(nu> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _handle_node_reg_resp: slurmctld sent back 11 TRES
This is the verbose output of the srun command (notice yellow messages). ➜ ~ srun -vvvp rtx hostname srun: defined options srun: -------------------- -------------------- srun: partition : rtx srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=3090276 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 34081 srun: debug: Entering _msg_thr_internal srun: Waiting for resource configuration srun: Nodes nodeGPU02 are ready for job srun: jobid 57463: nodes(1):`nodeGPU02', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug2: cpu:1 is not a gres: srun: debug: requesting job 57463, user 99, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name hostname, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi/pmix_v4: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 41393 srun: debug: mpi/pmix_v4: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:285: setup process mapping in srun srun: debug: Entering _msg_thr_create() srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread srun: debug: initialized stdio listening socket, port 33223 srun: debug: Started IO server thread (140079189182144) srun: debug: Entering _launch_tasks srun: launching StepId=57463.0 on host nodeGPU02, 1 tasks: 0 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: route/default: init: route default plugin loaded srun: debug2: Called _file_writable srun: topology/none: init: topology NONE plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug: slurm_recv_timeout at 0 of 4, timeout srun: error: slurm_receive_msgs: [[nodeGPU02]:6818] failed: Socket timed out on send/recv operation srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=1001 err=5004 type=9001 srun: debug2: marking task 0 done on failed node 0 srun: error: Task launch for StepId=57463.0 failed on node nodeGPU02: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug: mpi/pmix_v4: _conn_readable: (null) [0]: pmixp_agent.c:105: false, shutdown srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: pmixp_agent.c:355: Abort thread exit srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal srun: debug2: spank: spank_pyxis.so: exit = 0
This is the `tail -f` log of slurmctld when launching a simple `srun hostname` [2024-09-28T14:08:10.264] ==================== [2024-09-28T14:08:10.264] JobId=57463 nhosts:1 ncpus:1 node_req:1 nodes=nodeGPU02 [2024-09-28T14:08:10.264] Node[0]: [2024-09-28T14:08:10.264] Mem(MB):65536:0 Sockets:2 Cores:64 CPUs:1:0 [2024-09-28T14:08:10.264] Socket[0] Core[0] is allocated [2024-09-28T14:08:10.264] -------------------- [2024-09-28T14:08:10.264] cpu_array_value[0]:1 reps:1 [2024-09-28T14:08:10.264] ==================== [2024-09-28T14:08:10.264] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:10.264] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:10.264] gres_bit_alloc: of 3 [2024-09-28T14:08:10.264] gres_used:(null) [2024-09-28T14:08:10.264] topo[0]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:10.264] topo[1]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:10.264] topo[2]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:10.265] sched: _slurm_rpc_allocate_resources JobId=57463 NodeList=nodeGPU02 usec=1339 [2024-09-28T14:08:10.368] ==================== [2024-09-28T14:08:10.368] JobId=57463 StepId=0 [2024-09-28T14:08:10.368] JobNode[0] Socket[0] Core[0] is allocated [2024-09-28T14:08:10.368] ==================== [2024-09-28T14:08:30.409] _job_complete: JobId=57463 WTERMSIG 12 [2024-09-28T14:08:30.410] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:30.410] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:30.410] gres_bit_alloc: of 3 [2024-09-28T14:08:30.410] gres_used:(null) [2024-09-28T14:08:30.410] topo[0]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:30.410] topo[1]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:30.410] topo[2]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:30.410] _job_complete: JobId=57463 done [2024-09-28T14:08:58.687] gres/gpu: state for nodeGPU01 [2024-09-28T14:08:58.687] gres_cnt found:8 configured:8 avail:8 alloc:0 [2024-09-28T14:08:58.687] gres_bit_alloc: of 8 [2024-09-28T14:08:58.687] gres_used:(null) [2024-09-28T14:08:58.687] topo[0]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[0]:48-63 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[0]:0 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:58.687] topo[1]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[1]:48-63 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[1]:1 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:58.687] topo[2]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[2]:16-31 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[2]:2 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:58.687] topo[3]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[3]:16-31 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[3]:3 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[3]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[3]:1 [2024-09-28T14:08:58.688] topo[4]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[4]:112-127 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[4]:4 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[4]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[4]:1 [2024-09-28T14:08:58.688] topo[5]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[5]:112-127 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[5]:5 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[5]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[5]:1 [2024-09-28T14:08:58.688] topo[6]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[6]:80-95 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[6]:6 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[6]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[6]:1 [2024-09-28T14:08:58.688] topo[7]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[7]:80-95 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[7]:7 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[7]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[7]:1 [2024-09-28T14:08:58.688] type[0]:A100(808464705) [2024-09-28T14:08:58.688] type_cnt_alloc[0]:0 [2024-09-28T14:08:58.688] type_cnt_avail[0]:8 [2024-09-28T14:08:58.690] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:58.690] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:58.690] gres_bit_alloc: of 3 [2024-09-28T14:08:58.690] gres_used:(null) [2024-09-28T14:08:58.690] topo[0]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:58.690] topo[1]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:58.690] topo[2]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[2]:1 [2024-09-28T14:09:49.763] Resending TERMINATE_JOB request JobId=57463 Nodelist=nodeGPU02
This is the `tail -f` log of slurmd when launching the job from master, notice the messages in yellow [2024-09-28T14:08:10.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG [2024-09-28T14:08:10.321] debug2: prep/script: _run_subpath_command: prolog success rc:0 output: [2024-09-28T14:08:10.323] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG [2024-09-28T14:08:10.377] debug: Checking credential with 720 bytes of sig data [2024-09-28T14:08:10.377] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS [2024-09-28T14:08:10.377] debug2: Processing RPC: REQUEST_LAUNCH_TASKS [2024-09-28T14:08:10.377] launch task StepId=57463.0 request from UID:10082 GID:10088 HOST:10.10.0.1 PORT:36478 [2024-09-28T14:08:10.377] CPU_BIND: JobNode[0] CPU[0] Step alloc [2024-09-28T14:08:10.377] CPU_BIND: ==================== [2024-09-28T14:08:10.377] CPU_BIND: Memory extracted from credential for StepId=57463.0 job_mem_limit=65536 step_mem_limit=65536 [2024-09-28T14:08:10.377] debug: Waiting for job 57463's prolog to complete [2024-09-28T14:08:10.377] debug: Finished wait for job 57463's prolog to complete [2024-09-28T14:08:10.378] error: _send_slurmstepd_init failed [2024-09-28T14:08:10.384] debug2: debug level read from slurmd is 'debug2'. [2024-09-28T14:08:10.385] debug2: _read_slurmd_conf_lite: slurmd sent 11 TRES. [2024-09-28T14:08:10.385] debug2: Received CPU frequency information for 128 CPUs [2024-09-28T14:08:10.385] select/cons_tres: common_init: select/cons_tres loaded [2024-09-28T14:08:10.385] debug: switch/none: init: switch NONE plugin loaded [2024-09-28T14:08:10.385] [57463.0] debug: auth/munge: init: loaded [2024-09-28T14:08:10.385] [57463.0] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2024-09-28T14:08:10.395] [57463.0] debug: cgroup/v2: init: Cgroup v2 plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug2: Reading acct_gather.conf file /etc/slurm/acct_gather.conf [2024-09-28T14:08:10.396] [57463.0] debug2: hwloc_topology_init [2024-09-28T14:08:10.399] [57463.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found [2024-09-28T14:08:10.400] [57463.0] debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1 [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: core enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:773744M allowed:100%(enforced), swap:0%(enforced), max:100%(773744M) max+swap:0%(773744M) min:30M kmem:100%(773744M permissive) min:30M [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: memory enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: device enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded [2024-09-28T14:08:10.401] [57463.0] cred/munge: init: Munge credential signature plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: job_container/none: init: job_container none plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: gres/gpu: init: loaded [2024-09-28T14:08:10.401] [57463.0] debug: gpu/generic: init: init: GPU Generic plugin loaded [2024-09-28T14:08:30.415] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-09-28T14:08:30.415] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-09-28T14:08:30.415] debug: _rpc_terminate_job: uid = 777 JobId=57463 [2024-09-28T14:08:30.415] debug: credential for job 57463 revoked [2024-09-28T14:08:30.415] debug: sent SUCCESS, waiting for step to start [2024-09-28T14:08:30.415] debug: Blocked waiting for JobId=57463, all steps [2024-09-28T14:08:58.688] debug2: Start processing RPC: REQUEST_NODE_REGISTRATION_STATUS [2024-09-28T14:08:58.689] debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS [2024-09-28T14:08:58.689] debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused [2024-09-28T14:08:58.692] debug: _handle_node_reg_resp: slurmctld sent back 11 TRES. [2024-09-28T14:08:58.692] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS