Regarding performance, have a look at the release notes:

https://slurm.schedmd.com/release_notes.html

Maybe you are being bit by this change?


Docs at:

https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads

The sdiag command might help you debug this issue or performance in general.

Kind Regards

--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--

From: Grigory Shamov via slurm-users <slurm-users@lists.schedmd.com>
Sent: Sunday, September 28, 2025 9:52 AM
To: Brian Andrus <toomuchit@gmail.com>; slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: How to make TLS and PMIx v4 work together?
 

Hi Brian,

 

Thank you very much! We will try it.

 

Another thing we have noticed is a massive decrease of slurmctld performance. Had to 4x VM’s memory and CPU cores as compared to 24.11, so that 25.05  would run  without  freezing.

Does everyone have this , or we did misconfigure some settings of the new RPC connection manager?

 

-- 

Grigory Shamov

Site Lead / HPC Specialist

University of Manitoba and DRI Alliance Canada

 

 

From: Brian Andrus via slurm-users <slurm-users@lists.schedmd.com>
Reply-To: Brian Andrus <toomuchit@gmail.com>
Date: Thursday, September 25, 2025 at 6:02 PM
To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: How to make TLS and PMIx v4 work together?

 

Caution! This message was sent from outside the University of Manitoba.

 

Grigory, 

You likely need to add your CA to the nodes and update. Under Ubuntu, you would:

  • Put your CA public key file in /usr/local/share/ca-certificates/
  • Run /usr/sbin/update-ca-certificates

This should then create a pem file in /etc/ssl/certs for that CA and you can then trust certs signed by it.

You will need to do that on all your systems that need to trust your CA.

Brian Andrus

 

On 9/25/2025 11:11 AM, Grigory Shamov via slurm-users wrote:

Forgot to add: the s2n-tls comes from EPEL and is ver 1.5.10.
 
 
 
On 2025-09-25, 11:56 AM, "Grigory Shamov via slurm-users" <slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote:
 
 
Caution! This message was sent from outside the University of Manitoba.
 
 
 
 
Hi All,
 
 
We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4.
 
 
We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .
 
 
TLSType = tls/s2n
TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs)
 
 
And the errors when using PMIx are
 
 
025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these)
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV
 
 
2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?
 
 
There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?).
 
 
I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks!
 
 
--
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada
 
 
 
 
 
 
--
slurm-users mailing list -- slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>