- slurm-users - lists.schedmd.com

How to make TLS and PMIx v4 work together?
by Grigory Shamov 25 Sep '25

25 Sep '25

Hi All, We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4. We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun . TLSType = tls/s2n TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs) And the errors when using PMIx are 025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer [2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494 (couple of these) [2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37 [2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28 (couple of these) [2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message [2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination [2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV 2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit [2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist: (null) (and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ? There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?). I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks! -- Grigory Shamov Site Lead / HPC Specialist University of Manitoba and DRI Alliance Canada

1 0

Node switching randomly to down state
by Julien Tailleur 24 Sep '25

24 Sep '25

Dear all, I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging. My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly. From the master node, I can see that 3 servers on one of the queue appear down: jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 These servers are reachable by SSH/ping jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms --- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms ##### I can also put these nodes back into idle mode: root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 But then, they switch back into down mode few minutes later: root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14] I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior? Best wishes, Julien

4 4

No output and can't job by id
by Dhumal, Dr. Nilesh 22 Sep '25

22 Sep '25

Hello, Recently, we updated Slurm. We are running slrumcld on head node. The slurmdbd is disabled on the head node. The user submits the job and gets the job ID. The squeue is empty, and difficult to track the job progress. The job is not writing any output to a file. How can I resolve this issue? Nilesh Dhumal Associate Professor of Chemistry, http://faculty.fgcu.edu/ndhumal/ Coordinator, FGCU Computational Facility, https://www.fgcu.edu/cas/facultyresources/computationalfacility/ SH-430; Department of Chemistry and Physics Florida Gulf Coast University 10501 FGCU Boulevard South Fort Myers, FL 33965-6565 Phone: (239) 745-4394 Email: ndhumal(a)fgcu.edu

2 2

Node in drain state
by Gestió Servidors 22 Sep '25

22 Sep '25

Hello, I have got a node in "drain" state after finishing a job that was running on it. Log in node reports this information: [...] [2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 59238 [2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask for node: 0xFFF [2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW mask for node: 0xFFF [2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310 [2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin loaded [2025-09-07T11:09:27.007] [59238.batch] debug: auth/munge: init: loaded [2025-09-07T11:09:27.009] [59238.batch] debug: Reading cgroup.conf file /soft/slurm-23.11.0/etc/cgroup.conf [2025-09-07T11:09:27.025] [59238.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded [2025-09-07T11:09:27.025] [59238.batch] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: core enforcement enabled [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: device enforcement enabled [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfff [2025-09-07T11:09:27.027] [59238.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology Default plugin loaded [2025-09-07T11:09:27.030] [59238.batch] debug: gpu/generic: init: init: GPU Generic plugin loaded [2025-09-07T11:09:27.031] [59238.batch] debug: laying out the 12 tasks on 1 hosts clus09 dist 2 [2025-09-07T11:09:27.031] [59238.batch] debug: close_slurmd_conn: sending 0: No error [2025-09-07T11:09:27.031] [59238.batch] debug: Message thread started pid = 910040 [2025-09-07T11:09:27.031] [59238.batch] debug: Setting slurmstepd(910040) oom_score_adj to -1000 [2025-09-07T11:09:27.031] [59238.batch] debug: spank: opening plugin stack /soft/slurm-23.11.0/etc/plugstack.conf [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-11' [2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', logfile='debug', syslog='fatal' [2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks [2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 2025-09-07T11:09:27 [2025-09-07T11:09:27.098] [59238.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu [2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC : reducing req:255366 to max:159631 [2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2. [2025-09-07T11:09:27.399] [59238.batch] debug: task/affinity: task_p_post_term: affinity StepId=59238.batch, task 0 [2025-09-07T11:09:27.399] [59238.batch] debug: signaling condition [2025-09-07T11:09:27.399] [59238.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2025-09-07T11:09:27.400] [59238.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, job_rc = 512 [2025-09-07T11:09:27.410] [59238.batch] debug: Message thread exited [2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step (rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error) [2025-09-07T11:09:27.411] debug: _rpc_terminate_job: uid = 1000 JobId=59238 [2025-09-07T11:09:27.411] debug: credential for job 59238 revoked [...] "sinfo" shows: [root@login-node ~]# sinfo PARTITION TIMELIMIT AVAIL STATE NODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON node.q* 4:00:00 up drained clus09 0.00 0/0 0/0/1/1 12 0/0/12/12 Kill task faile node.q* 4:00:00 up allocated clus[10-11] 13.82-15.8 2/0 2/0/0/2 12 24/0/0/24 none node.q* 4:00:00 up idle clus[01-06,12] 0.00 0/7 0/7/0/7 12 0/84/0/84 none But it seems there is no error in node... Slurmctld.log in server seems correct, too. Is there any way to reset node to "state=idle" after errors in the same way? Thanks.

3 4

Compute node not responding
by Dhumal, Dr. Nilesh 21 Sep '25

21 Sep '25

Hello, Recently, we installed slum 25 on our cluster. We are not monitoring the user's account. We didn't configure the sql database on the head node. We are running slurmcld on head node and slumd on the compute node. We are getting the following error Head node: compute node not responding. Compute node: 2025-09-19T15:30:23.461] error: Unable to register: Unable to contact slurm controller (connect failure) Do we need to run slumdbd on the head node? I checked the network connection by pinging the compute node from the head node. Do you have any suggestions to resolve this issue? Thanks Nilesh Get Outlook for Android<https://aka.ms/AAb9ysg>

4 5

Re: Node in drain state
by Gestió Servidors 19 Sep '25

19 Sep '25

Hi, After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout" to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is 60s and, always 5 times larger than "MessageTimeout"). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized "UnkillableStepProgram" and if he/she could explain that. Thanks a lot!

3 2

seff for GPU
by Josu Lazkano Lete 18 Sep '25

18 Sep '25

Hello, We are looking to optimize the GPU jobs of our HPC users, is it possible to add GPU info in the seff? It will be great to know how much GPU resources the users request and compare with how much GPU resources they use. Kind regards. [image: Vicomtech] <https://www.vicomtech.org> Josu Lazkano Lete Systems Manager Infrastructures and General Services jlazkano(a)vicomtech.org +(34) 943 30 92 30 The information contained in this electronic message is intended only for the personal and confidential use of the recipients. If you have received this e-mail by mistake, please, notify us and delete it. Avoid printing this message if it is not strictly necessary.

7 9

Scheduling issues with multiple different types of GPU in one partition
by Kevin M. Hildebrand 17 Sep '25

17 Sep '25

We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available. For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs. Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy. We are currently running Slurm 24.05.8. Thanks, Kevin -- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

7 7

New "NOT-state" selection of the sinfo command in Slurm 25.05
by Ole Holm Nielsen 10 Sep '25

10 Sep '25

We just upgraded Slurm to 25.05.3, and I would like to highlight a new functionality of the "sinfo -t <state>" command in 25.05: > The state can be prefixed with '~' which will invert the result of match. We find this "NOT-state" selection useful together with Slurm power saving [1] where any idle nodes may be powered off to save electrical power or cloud costs. For example, the ClusterShell [2] clush command can now skip connections to the powered down nodes like in this example: $ clush -bw@slurmstate:~POWERED_DOWN uname -r You can even configure ClusterShell so that "clush -a" includes all nodes which are *not* powered down. The details are described in the Wiki page [3]. Best regards, Ole [1] https://slurm.schedmd.com/power_save.html [2] https://clustershell.readthedocs.io/en/latest/intro.html [3] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#skipping-powere… -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

1 0

Development RPMs for cgroups v2
by John Hearns 10 Sep '25

10 Sep '25

I am building a version of Slurm on RHEL 9.4 When I run an rpmbuild, he slurm rpm does not contain /usr/lib64/slurm/cgroup_v2.so I have tried to look in the build logs. I suspect I am lacking some development RPMs - but which ones? All hints gratefully received John H

4 5

2026

2025

2024

slurm-users