We are pleased to announce the availability of Slurm release candidate
25.05.0rc1.
To highlight some new features coming in 25.05:
- Support for defining multiple topology configurations, and varying
them by partition.
- Support for tracking and allocating hierarchical resources.
- Dynamic nodes can be dynamically added to the topology.
- topology/block - Allow for gaps in the block layout.
- Support for encrypting all network communication with TLS.
- jobcomp/kafka - Optionally send job …
[View More]info at job start as well as job end.
- Support an OR operator in --license requests.
- switch/hpe_slingshot - Support for > 252 ranks per node.
- switch/hpe_slingshot - Support mTLS authentication to the fabric manager.
- sacctmgr - Add support for dumping and loading QOSes.
- srun - Add new --wait-for-children option to keep the step running
until all launched processes have been launched (cgroup/v2 only).
- slurmrestd - Add new endpoint for creating reservations.
This is the first release candidate of the upcoming 25.05 release
series, and represents the end of development for this release, and a
finalization of the RPC and state file formats.
If any issues are identified with this release candidate, please report
them through https://bugs.schedmd.com against the 25.05.x version and we
will address them before the first production 25.05.0 release is made.
Please note that the release candidates are not intended for production use.
A preview of the updated documentation can be found at
https://slurm.schedmd.com/archive/slurm-master/ .
Slurm can be downloaded from https://www.schedmd.com/download-slurm/.
The changelog for 25.05.0rc1 can be found here:
https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-25.05.md#chang…
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
[View Less]
Hello,
we are running a SLURM-managed cluster with one control node (g-vm03)
and 26 worker nodes (ouga[03-28]) on Rocky 8. We recently updated from
20.11.9 through 23.02.8 to 24.11.0 and then 24.11.5. Since then, we are
experiencing performance issues - squeue and scontrol ping are slow to
react and sometimes deliver "timeout on send/recv" messages, even with
only very few parallel requests. We did not experience these issues with
SLURM 20.11.9 before, we did not check the intermediate …
[View More]version 23.02.8
in detail before. In the log of slurmctld, we can also find messages like
slurmctld: error: slurm_send_node_msg: [socket:[1272743]]
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
socket error
We thus implemented all recommendations from the high throughput
documentation, and did achieve improvements with it (most notably by
increasing the maximum number of open files and increasing
MessageTimeout and TCPTimeout).
For debugging, I attached the slurm.conf, the sdiag output (the server
thread count is almost always 1 and sometimes increases to 2), the
slurmctld log and the slurmdbd log from a time of high load.
We would be very thankful for any input on how restore the old performance.
Kind Regards,
Tilman Hoffbauer
[View Less]
Dear Slurm-User List,
currently, in our slurm.conf, we are setting:
SelectType: select/cons_tres
SelectTypeParameters: CR_Core
and in our node configuration /RealMemory /was basically reduced by an
amount to make sure the node always had enough RAM to run the OS.
However, this is apparently now how it is supposed to be done:
Lowering RealMemory with the goal of setting aside some amount for
the OS and not available for job allocations will not work as
intended if …
[View More]Memory is not set as a consumable resource in
*SelectTypeParameters*. So one of the *_Memory options need to be
enabled for that goal to be accomplished.
(https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory)
This leads to four questions regarding holding back RAM for worker
nodes. Answers/help with any of those questions would be appreciated.
*1.* Is reserving enough RAM for the worker node's OS and slurmd
actually a thing you have to manage?
*2.* If so how can we reserve enough RAM for the worker node's OS
and slurmd when using CR_Core?
*3.* Is that maybe a strong argument against using CR_Core that we
overlooked?
And semi-related:
https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory talks about
taking a value in megabytes.
*4.* Is RealMemory really expecting megabytes or is it mebibytes?
Best regards,
Xaver
[View Less]
Hi,
I am not sure what i have missed but I am getting this error on a compute node.
======
root@vm01no16 ~]# systemctl status slurmd.service
× slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Sun 2025-05-11 23:01:29 UTC; 15min ago
Process: 1178 ExecStart=/usr/sbin/slurmd --systemd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1178 (code=exited, status=1/…
[View More]FAILURE)
CPU: 3ms
May 11 23:01:29 vm01no16.ods.vuw.ac.nz systemd[1]: Starting Slurm node daemon...
May 11 23:01:29 vm01no16.ods.vuw.ac.nz slurmd[1178]: slurmd: fatal: Unable to determine this slurmd's NodeName
May 11 23:01:29 vm01no16.ods.vuw.ac.nz slurmd[1178]: fatal: Unable to determine this slurmd's NodeName
May 11 23:01:29 vm01no16.ods.vuw.ac.nz systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
May 11 23:01:29 vm01no16.ods.vuw.ac.nz systemd[1]: slurmd.service: Failed with result 'exit-code'.
May 11 23:01:29 vm01no16.ods.vuw.ac.nz systemd[1]: Failed to start Slurm node daemon.
[root@vm01no16 ~]# uname -a
Linux vm01no16.ods.vuw.ac.nz 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 30 17:38:54 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@vm01no16 ~]# vi /etc/hosts
[root@vm01no16 ~]# cat /etc/hostname
vm01no16.ods.vuw.ac.nz
[root@vm01no16 ~]# tail /etc/hosts
# Entry for int01no9.ods.vuw.ac.nz
130.195.87.29 int01no9.ods.vuw.ac.nz int01no9.ods.vuw.ac.nz-default
# Entry for vm01no13.ods.vuw.ac.nz
130.195.87.41 vm01no13.ods.vuw.ac.nz vm01no13.ods.vuw.ac.nz-default
# Entry for vm01no14.ods.vuw.ac.nz
130.195.87.42 vm01no14.ods.vuw.ac.nz vm01no14.ods.vuw.ac.nz-default
# Entry for vm01no15.ods.vuw.ac.nz
130.195.87.43 vm01no15.ods.vuw.ac.nz vm01no15.ods.vuw.ac.nz-default
# Entry for vm01no16.ods.vuw.ac.nz
130.195.87.44 vm01no16.ods.vuw.ac.nz vm01no16.ods.vuw.ac.nz-default
[root@vm01no16 ~]#
======
regards
Steven Jones
B.Eng (Hons)
Technical Specialist - Linux RHCE
Victoria University, Digital Solutions,
Level 8 Rankin Brown Building,
Wellington, NZ
6012
0064 4 463 6272
[View Less]
Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available and
include a fix for a recently discovered security issue.
SchedMD customers were informed on April 23rd and provided a patch on
request; this process is documented in our security policy. [1]
A mistake with permission handling for Coordinators within Slurm's
accounting system can allow a Coordinator to promote a user to
Administrator. (CVE-2025-43904)
Thank you to Sekou Diakite (HPE) for reporting this.
Downloads are …
[View More]available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.11.5
> ==========================
> -- Return error to scontrol reboot on bad nodelists.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.40
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.41
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.42
> endpoints.
> -- data_parser/v0.0.42 - Added +inline_enums flag which modifies the
> output when generating OpenAPI specification. It causes enum arrays to not
> be defined in their own schema with references ($ref) to them. Instead they
> will be dumped inline.
> -- Fix binding error with tres-bind map/mask on partial node allocations.
> -- Fix stepmgr enabled steps being able to request features.
> -- Reject step creation if requested feature is not available in job.
> -- slurmd - Restrict listening for new incoming RPC requests further into
> startup.
> -- slurmd - Avoid auth/slurm related hangs of CLI commands during startup
> and shutdown.
> -- slurmctld - Restrict processing new incoming RPC requests further into
> startup. Stop processing requests sooner during shutdown.
> -- slurmcltd - Avoid auth/slurm related hangs of CLI commands during
> startup and shutdown.
> -- slurmctld: Avoid race condition during shutdown or reconfigure that
> could result in a crash due delayed processing of a connection while
> plugins are unloaded.
> -- Fix small memleak when getting the job list from the database.
> -- Fix incorrect printing of % escape characters when printing stdio
> fields for jobs.
> -- Fix padding parsing when printing stdio fields for jobs.
> -- Fix printing %A array job id when expanding patterns.
> -- Fix reservations causing jobs to be held for Bad Constraints
> -- switch/hpe_slingshot - Prevent potential segfault on failed curl
> request to the fabric manager.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- switch/hpe_slingshot - Fix vni range not updating on slurmctld restart
> or reconfigre.
> -- Fix steps not being created when using certain combinations of -c and
> -n inferior to the jobs requested resources, when using stepmgr and nodes
> are configured with CPUs == Sockets*CoresPerSocket.
> -- Permit configuring the number of retry attempts to destroy CXI service
> via the new destroy_retries SwitchParameter.
> -- Do not reset memory.high and memory.swap.max in slurmd startup or
> reconfigure as we are never really touching this in slurmd.
> -- Fix reconfigure failure of slurmd when it has been started manually and
> the CoreSpecLimits have been removed from slurm.conf.
> -- Set or reset CoreSpec limits when slurmd is reconfigured and it was
> started with systemd.
> -- switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after
> the controller restarts or reconfigures while the job is running.
> -- Fix backup slurmctld failure on 2nd takeover.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 24.05.8
> ==========================
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 23.11.11
> ===========================
> -- Fixed a job requeuing issue that merged job entries into the same SLUID
> when all nodes in a job failed simultaneously.
> -- Add ABORT_ON_FATAL environment variable to capture a backtrace from any
> fatal() message.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
[View Less]
Greetings,
We are new to Slurm and we are trying to better understand why we’re seeing
high-mem jobs stuck in Pending state indefinitely. Smaller (mem) jobs in
the queue will continue to pass by the high mem jobs even when we bump
priority on a pending high-mem job way up. We have been reading over the
backfill scheduling page and what we think we're seeing is that we need to
require that users specify a --time parameter on their jobs so that
Backfill works properly. None of our users specify …
[View More]a --time param because
we have never required it. Is that what we need to require in order to fix
this situation? From the backfill page: "Backfill scheduling is difficult
without reasonable time limit estimates for jobs, but some configuration
parameters that can help" and it goes on to list some config params that we
have not set (DefaultTime, MaxTime, OverTimeLimit). We also see language
such as, “Since the expected start time of pending jobs depends upon the
expected completion time of running jobs, reasonably accurate time limits
are important for backfill scheduling to work well.” So we suspect that we
can achieve proper backfill scheduling by requiring that all users supply a
"--time" parameter via a job submit plugin. Would that be a fair statement?
Thank you in advance!
-Mike Schor
[View Less]
Dear Slurm community,
I am confused by the behaviour of a freshly built openmpi-5.0.7 with
slurm-24.11.4. I can run a simple hello-world program via mpirun, but
with really slow startup (a single process needing 1.6 s, 384 processes
on two 192-core nodes need around half a minute).
I guess there is a deeper issue here to work out with openmpi itself
and why it needs 1.3 seconds to start a single process, even, outside a
Slurm environment.
So I investigate if running via srun changes things. …
[View More]Open-MPI docs
recommend using mpirun and that has been traditionally our safe bet.
But srun direct launch is supposed to work, too. I did duly build
slurm-24.11.4 with
./configure --prefix=/syssw/slurm/24.11.4 \
--sysconfdir=/syssw/etc/slurm \
--with-munge=/syssw/munge/0.5.16 \
--with-hwloc=/syssw/hwloc/2.11.2 \
--disable-static --with-json \
--with-pmix=/syssw/pmix/3.2.5:/syssw/pmix/5.0.7 \
LDFLAGS=-Wl,--disable-new-dtags
providing the two versions of pmix on our system currently.
Now I am perplexed to observe that
$ srun -vv --mpi=pmix_v5 -N 1 -n 1 mpihello
does _not_ work, but produces
c$ srun -vv -n 1 -N 1 --mpi=pmix_v5 mpihello
srun: defined options
srun: -------------------- --------------------
srun: (null) : n[164-165]
srun: jobid : 671133
srun: job-name : interactive
srun: mpi : pmix_v5
srun: nodes : 1
srun: ntasks : 1
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2)
srun: debug: requesting job 671133, user 99, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name mpihello, relative 65534
srun: CpuBindType=none
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v5: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 36505
srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread
srun: debug: mpi/pmix_v5: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: initialized stdio listening socket, port 38681
srun: debug: Started IO server thread
srun: debug: Entering _launch_tasks
srun: launching StepId=671133.3 on host n165, 1 tasks: 0
srun: topology/tree: init: topology tree plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Node n165, 1 tasks started
[n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
[n165:2287661] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Received task exit notification for 1 task of StepId=671133.3 (status=0x0e00).
srun: error: n165: task 0: Exited with exit code 14
srun: debug: task 0 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v5: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown
srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
Observe the line
[n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056
anyone got an idea what that means, by what it's caused?
And, what really confuses me, is that the test program _does_ work if I
switch from pmix_v5 to pmix_v3:
$ srun -vv -n 1 -N 1 --mpi=pmix_v3 mpihello
srun: defined options
srun: -------------------- --------------------
srun: (null) : n[164-165]
srun: jobid : 671133
srun: job-name : interactive
srun: mpi : pmix_v3
srun: nodes : 1
srun: ntasks : 1
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2)
srun: debug: requesting job 671133, user 99, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name mpihello, relative 65534
srun: CpuBindType=none
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 43737
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: initialized stdio listening socket, port 43805
srun: debug: Started IO server thread
srun: debug: Entering _launch_tasks
srun: launching StepId=671133.4 on host n164, 1 tasks: 0
srun: topology/tree: init: topology tree plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Node n164, 1 tasks started
hello world from processor n164, rank 0 out of 1
srun: Received task exit notification for 1 task of StepId=671133.4 (status=0x0000).
srun: n164: task 0: Completed
srun: debug: task 0 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
How can a PMIx 5 MPI even work with the pmix_v3 plugin? Why does it
_not_ work with the pmix_v5 plugin? I am also curious why the plugins
don't link to the respective libpmix (are they using dlopen for their
dependencies? Why?).
$ ldd /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so:
linux-vdso.so.1 (0x00007ffd19ffb000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014d199e56000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014d199c70000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014d199b90000)
/lib64/ld-linux-x86-64.so.2 (0x000014d199edd000)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so:
linux-vdso.so.1 (0x00007ffd265f2000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x00001553902c8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00001553900e2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000155390002000)
/lib64/ld-linux-x86-64.so.2 (0x0000155390350000)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so:
linux-vdso.so.1 (0x00007ffd862b7000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x0000145adc36d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000145adc187000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000145adc0a7000)
/lib64/ld-linux-x86-64.so.2 (0x0000145adc3f4000)
But they do have the proper RPATH set up:
$ readelf -d /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so | grep -e ^File -e PATH
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/3.2.5/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
Which is important, since libpmix doesn't get sensible SONAME
versioning (supposing they are supposed to be separate ABIs):
$ find /syssw/pmix/* -name 'libpmix.so*'
/syssw/pmix/3.2.5/lib/libpmix.so
/syssw/pmix/3.2.5/lib/libpmix.so.2.2.35
/syssw/pmix/3.2.5/lib/libpmix.so.2
/syssw/pmix/5.0.7/lib/libpmix.so
/syssw/pmix/5.0.7/lib/libpmix.so.2.13.7
/syssw/pmix/5.0.7/lib/libpmix.so.2
It's all libpmix.so.2. My mpihello program uses the 5.0.7 one, at least:
$ ldd mpihello
linux-vdso.so.1 (0x00007fff1f13f000)
libmpi.so.40 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libmpi.so.40 (0x000014ed89d95000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014ed89baf000)
libopen-pal.so.80 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libopen-pal.so.80 (0x000014ed89a25000)
libfabric.so.1 => /syssw/fabric/1.21.0/lib/libfabric.so.1 (0x000014ed8989b000)
libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x000014ed8988d000)
libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x000014ed8986c000)
libpsm2.so.2 => /syssw/psm2/12.0.1/lib/libpsm2.so.2 (0x000014ed89804000)
libatomic.so.1 => /sw/compiler/gcc-13.3.0/lib64/libatomic.so.1 (0x000014ed897fb000)
libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x000014ed8976a000)
libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x000014ed89745000)
libpmix.so.2 => /syssw/pmix/5.0.7/lib/libpmix.so.2 (0x000014ed8951e000)
libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x000014ed894e8000)
libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x000014ed894e3000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014ed89486000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014ed893a6000)
/lib64/ld-linux-x86-64.so.2 (0x000014ed8a0d8000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x000014ed89397000)
Can someone shed light on how the differing PMIx plugins are supposed
to work? Can someone share a setup where pmix_v5 does work with openmpi
5.x?
Alrighty then,
Thomas
--
Dr. Thomas Orgis
HPC @ Universität Hamburg
[View Less]
Hi folks
I'm in the process of standing up some Stampede2 Dell C3620p KNL nodes and I seem to be hitting a blind spot.
I previously "successfully" configured KNL's on an Intel board (S7200AP), with OpenHPC and Rocky8. I say "successfully" because it works but evidently my latest troubleshooting has revealed that I may have been lucky rather than an expert KNL integrator :-)
I thought I knew what I was doing, but after repeating my Intel KNL recipe with the Dell system, I have unearthed my …
[View More]ignorance with this wonderful (but deprecated) technology (anecdotally, the KNLs offer excellent performance and power efficiency for our workloads, particularly when contrasted with our alternative available hardware).
The first discovery was the "syscfg" for Intel boards is not the same as the "syscfg" for Dell boards. I've since sorted this out.
The second discovery was made while troubleshooting an issue that I'm hitting. After realising that the slurmd client nodes don't seem to be reading the "knl_generic.conf" parameters that are specified in /etc/slurm on the smshost (OpenHPC parlance for head node ... And it's a Slurm config less set up), I think my original Intel solution was working out of luck more than ingenuity.
For reference , the Slurm configuration for KNL now includes:
```
NodeFeaturesPlugins=knl_generic
DebugFlags=NodeFeatures
GresTypes=hbm
```
And I've created a separate "knl_generic.conf" that points to the Dell specific tools and features.
For the Dell system, slurmd seems to ignore my knl_generic.conf file and is drawing defaults from somewhere else. Slurm still considers SystemType to be Intel, SyscfgPath to be the default location, and SyscfgTimeout to be 1000. For Dell systems, Slurm needs to have SystemType=Dell and Timeout to be 1000.
I don't understand why the nodes are not reading the knl_generic file - any help or clues would be appreciated.
Here's my theory on what is happening:
The Intel KNL system was successful by luck ... It probably exhibited the same ignore-the-config-file but ran default NodeFeatures for some generic knl_generic settings which are stored somewhere as default parameters. I must have just lucked out when I was using my Intel KNL system because it was using the defaults (that are compatible with Intel).
If this assumption is correct, the Dell system is not working because it isn't compatible with the Intel defaults.
Any clues on how to successfully invoke the config file (or better debuggingtechniwues to figure out why it isn't) would be appreciated.
I can share journalctl feedback if necessary. For now, I've tried changing ownership of the config files to root:slurm, copied knl_generic.conf to the compute nodes' /etc/slurm/ and also tried to specify the config file by running (on the compute nodes) "slurmd" with "-f" ... No joy; if slurmd runs successfully (when I don't screw up some random experimental settings) then it always seems to ignore knl_generic.conf and loads some default settings from somewhere.
A few questions:
1. Are there default settings stored somewhere? I might be barking up the wrong tree, although I've looked for files that may clash with the config file I've created but can't find any.
2. Is there a better way to force the knl_generic file to be loaded?
3. Is the configless Slurm somehow not reading the knl_generic file to the clients? I understand that all configuration files are read from the host server.
Many thanks for any help!
Regards / Groete / Sala(ni) Kahle
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Bryan Johnston
Senior HPC Technologist II
Lead: HPC Ecosystems Project
HPCwire 2024's Outstanding Leader in HPC
CHPC | www.chpc.ac.za | NICIS | nicis.ac.za
Centre for High Performance Computing
If you receive an email from me out of office hours for you, please do not feel obliged to respond during off-hours!
Book time to meet with me<https://outlook.office.com/bookwithme/user/87af4906a703488386578f34e4473c74…>
[View Less]
Does anyone have any experience with using Kerberos/GSSAPI and Slurm? I’m specifically wondering if there is a known mechanism for providing proper Kerberos credentials to Slurm batch jobs, such that those processes would be able to access a filesystem that requires Kerberos credentials. Some quick searching returned nothing useful. Interactive jobs have a similar problem, but I’m hoping that SSH credential forwarding can be leveraged there.
I’m nothing like an expert in Kerberos, so forgive …
[View More]any apparent ignorance.
Thanks,
John
[View Less]