- slurm-users - lists.schedmd.com

submit host DNS lookup errors stall slurmctld, fail nodes, kill jobs
by moorehfl＠amazon.com 06 Nov '25

06 Nov '25

As we've scaled up our slurm usage, we've noticed that short, moderate bursts of DNS lookup failures are enough to regularly stall slurmctld: _xgetaddrinfo: getaddrinfo(runner-t35knco5d-project-54-concurrent-0:37251) failed ...this has a cascading effect where, when stalled, the controlled can't always communicate with nodes: error: Error connecting, bad data: family = 0, port = 0 ...and immediately the controller will mark the nodes as unhealthy, and kill jobs: slurmctld: Killing JobId=3120751 on failed node slurm-0f6cacdc1 The reason for the DNS failures is not unreliable DNS server or network, but rather the jobs are submitted by containers that don't have resolvable hostnames. This traditionally hasn't disrupted functionality, but we've noticed if 8-10 jobs all terminate at the same time (submitter container SIGTERMs the srun process) that the controller can be easily overloaded for several seconds, despite having significant free system resources. gdb confirms the process is hanging on DNS. We also can see "Socket timed out on send/recv operation" from clients attempting to interact with the controller during the issue. slurm 24.11.0 RHEL 8.10 kernel 4.18.0-553.58.1.el8_10.x86_64 We're looking into ways to get our ephemeral job submitter containers resolvable in DNS to prevent lookup failures (either by giving them resolvable hostnames, or by blackholing the records to 0.0.0.0 to allow for fast local failure on the slurmctld server). However, it does seem unusual for a handful of bad DNS lookups to cause so much disruption in slurmctld. Is this a known weak point of ctld? The slurmctld host is a single-purpose 16 vCPU 30GB EC2 instance with minimal load. We have ~150 nodes, all nodes have valid IPs in slurm.conf to remove the need for ctld to perform lookups for nodes, but apparently there is still a need to lookup the submit host as well, and we can reliably reproduce these cascading failures. Another possiblity might be to extend SlurmdTimeout to something very long and hope that the controller recovers from its stall in enough time to prevent from marking nodes as unhealthy and killing jobs, but it's not clear if that will have any effect since the first occurrence of "error: Error connecting, bad data: family = 0, port = 0" immediately drains nodes and kills jobs. Thanks

1 0

Slurm version 25.11 is now available
by Marshall Garey 06 Nov '25

06 Nov '25

We are pleased to announce the availability of Slurm 25.11. The release notes summarizing the new features, and including links to the corresponding documentation, can be found at: https://slurm.schedmd.com/release_notes.html A more extensive list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.11/CHANGELOG/slurm-25.11.md The Slurm documentation has also been updated to the 25.11 release: https://slurm.schedmd.com Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

CPU frequency setting not configured for this node
by Felix 06 Nov '25

06 Nov '25

Hello I get this error: "CPU frequency setting not configured for this node" in some of my node. I do not know what to do, can you please advice. Thank you Felix -- Dr. Eng. Farcas Felix National Institute of Research and Development of Isotopic and Molecular Technology, IT - Department - Cluj-Napoca, Romania Mobile: +40742195323

1 0

Move to higher qos when cluster idle
by Ratnasamy, Fritz 06 Nov '25

06 Nov '25

Hi, Is there an option to automatically have users move to a higher qos when the cluster is x% idle? I was reading about customizing the job_submit.lua script in order to do that but I was wondering if it makes sense. We have a gpu cluster and many users complain they can not allocate more than their current qos when the cluster is not very busy. Best, *Fritz Ratnasamy*Data Scientist Information Technology

1 0

sreport and AccountUtilizationByUser
by Gestió Servidors 03 Nov '25

03 Nov '25

Hello, I woud like to get a full report that includes CPU usage of an account (detalied by user). I have run "sreport cluster AccountUtilizationByUser start=9/1 end=11/2 account=mygr" and the output is this: -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2025-09-01T00:00:00 - 2025-11-01T23:59:59 (5360400 secs) Usage reported in CPU Minutes -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- --------- -------- mycluster mygr 3979368 0 mycluster mygr mygr-1 3979368 0 mycluster mygr mygr-10 3979368 0 mycluster mygr mygr-11 3979368 0 mycluster mygr mygr-12 3979368 0 mycluster mygr mygr-13 3979368 0 mycluster mygr mygr-14 3979368 0 mycluster mygr mygr-15 3979368 0 mycluster mygr mygr-16 3979368 0 mycluster mygr mygr-17 3979368 0 mycluster mygr mygr-18 3979368 0 mycluster mygr mygr-19 3979368 0 mycluster mygr mygr-2 3979368 0 mycluster mygr mygr-20 3979368 0 mycluster mygr mygr-21 3979368 0 mycluster mygr mygr-22 3979368 0 mycluster mygr mygr-23 3979368 0 mycluster mygr mygr-24 3979368 0 mycluster mygr mygr-25 3979368 0 mycluster mygr mygr-26 3979368 0 mycluster mygr mygr-27 3979368 0 mycluster mygr mygr-28 3979368 0 mycluster mygr mygr-29 3979368 0 mycluster mygr mygr-3 3979368 0 mycluster mygr mygr-30 3979368 0 mycluster mygr mygr-31 3979368 0 mycluster mygr mygr-32 3979368 0 mycluster mygr mygr-33 3979368 0 mycluster mygr mygr-34 3979368 0 mycluster mygr mygr-35 3979368 0 mycluster mygr mygr-36 3979368 0 mycluster mygr mygr-37 3979368 0 mycluster mygr mygr-38 3979368 0 mycluster mygr mygr-39 3979368 0 mycluster mygr mygr-4 3979368 0 mycluster mygr mygr-40 3979368 0 mycluster mygr mygr-5 3979368 0 mycluster mygr mygr-6 3979368 0 mycluster mygr mygr-7 3979368 0 mycluster mygr mygr-8 3979368 0 mycluster mygr mygr-9 3979368 0 I don't understand why column "Used" shows always the same value, because each user "mygr-XX" has used the cluster in differents ways, times and each used has submitted very different number of jobs, as you can see in this "sshare" output: [root@login bin]# sshare -l -A mygr Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- mygr 1 0.250000 3941705 0.337402 0.961033 0.260137 mygr mygr-1 1 0.024390 32099 0.002748 0.008144 0.095146 2.995008 mygr mygr-10 1 0.024390 98343 0.008418 0.024949 0.067961 0.977586 mygr mygr-11 1 0.024390 213838 0.018304 0.054250 0.052427 0.449588 mygr mygr-12 1 0.024390 134149 0.011483 0.034033 0.056311 0.716657 mygr mygr-13 1 0.024390 89304 0.007644 0.022656 0.071845 1.076537 mygr mygr-14 1 0.024390 524836 0.044925 0.133150 0.040777 0.183179 mygr mygr-15 1 0.024390 303494 0.025979 0.076996 0.044660 0.316774 mygr mygr-16 1 0.024390 99020 0.008476 0.025121 0.066019 0.970897 mygr mygr-17 1 0.024390 296341 0.025366 0.075181 0.046602 0.324421 mygr mygr-18 1 0.024390 37809 0.003236 0.009592 0.085437 2.542696 mygr mygr-19 1 0.024390 72158 0.006177 0.018306 0.077670 1.332337 mygr mygr-2 1 0.024390 225658 0.019316 0.057249 0.050485 0.426039 mygr mygr-20 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-21 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-22 1 0.024390 115042 0.009847 0.029186 0.062136 0.835683 mygr mygr-23 1 0.024390 54169 0.004637 0.013743 0.081553 1.774775 mygr mygr-24 1 0.024390 125767 0.010765 0.031907 0.058252 0.764422 mygr mygr-25 1 0.024390 243088 0.020808 0.061671 0.048544 0.395490 mygr mygr-26 1 0.024390 35516 0.003040 0.009010 0.089320 2.706902 mygr mygr-27 1 0.024390 65281 0.005588 0.016562 0.079612 1.472682 mygr mygr-28 1 0.024390 118203 0.010118 0.029988 0.060194 0.813335 mygr mygr-29 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-3 1 0.024390 107543 0.009206 0.027284 0.064078 0.893956 mygr mygr-30 1 0.024390 33946 0.002906 0.008612 0.091262 2.832099 mygr mygr-31 1 0.024390 183059 0.015670 0.046442 0.054369 0.525180 mygr mygr-32 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-33 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-34 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-35 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-36 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-37 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-38 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-39 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-4 1 0.024390 37605 0.003219 0.009541 0.087379 2.556486 mygr mygr-40 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-5 1 0.024390 81114 0.006943 0.020578 0.073786 1.185232 mygr mygr-6 1 0.024390 91873 0.007864 0.023308 0.069903 1.046428 mygr mygr-7 1 0.024390 80513 0.006892 0.020426 0.075728 1.194082 mygr mygr-8 1 0.024390 367822 0.031485 0.093316 0.042718 0.261374 mygr mygr-9 1 0.024390 33630 0.002879 0.008532 0.093204 2.858713 mygr mygr-tutor 1 0.024390 40472 0.003464 0.010268 0.083495 2.375427 What am I doing wrong? What I need to get is a detalied report from all "mygr" users (mygr account) from CPU usage and, if it is possible, other values. Thanks.

1 0

Slurm release candidate version 25.11.0rc1 is available for testing
by Tim Wickberg 28 Oct '25

28 Oct '25

We are pleased to announce the availability of Slurm release candidate 25.11.0rc1. To highlight some new features coming in 25.11: * Added new "Expedited Requeue" mode for batch jobs. Batch jobs with --requeue=expedite will automatically requeue on node failure, or if the batch script returns a non-zero exit code and one or more Epilog scripts fail. Expedited requeue jobs are eligible to restart immediately, are treated as the highest priority job in the system, and their previously allocated set of nodes will be prevented from launching other work. * Added a new "Mode 3" of operation to Hierarchical Resources. This mode complements the existing Mode 1 and Mode 2 by summing usage from lower levels automatically. This can be used, e.g., to implement a power-capping mode modeling power distribution between the datacenter, local distribution, and individual racks. * Added direct support for exporting OpenMetrics (Prometheus) telemetry from slurmctld. This is accessible on SlurmctldPort on SlurmctldHost by default, or can be disabled if desired. * Added an experimental asynchronous-reply mode to slurmctld. If enabled with "SlurmctldParameters=enable_async_reply", RPC responses are offloaded to the kernel for further processing, freeing individual worker threads for new traffic. This is the first release candidate of the upcoming 25.11 release series, and represents the end of development for this release, and a finalization of the RPC and state file formats. If any issues are identified with this release candidate, please report them through https://bugs.schedmd.com against the 25.11.x version and we will address them before the first production 25.11.0 release is made. Please note that the release candidates are not intended for production use. A preview of the updated documentation can be found at https://slurm.schedmd.com/archive/slurm-master/ . Slurm can be downloaded from https://www.schedmd.com/download-slurm/. The changelog for 25.11 can be found here: https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-25.11.md -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

AuthInfo broken in 25.05.1 ?
by michael＠mayer.cx 27 Oct '25

27 Oct '25

I am using the AuthInfo in both slurm.conf and slurmdbd.conf to point to a different location for the munge socket. This works as expected in slurmdbd, but it does not work for slurmctld. In slurmctld it seems to ignore this setting and try to locate the munge socket at the default location and then fail. When using the same approach in 23.11.11, for example, AuthInfo is working as expected for both slurmctld and slurmdbd. Sample config from slurm.conf and slurmdbd.conf AuthType=auth/munge AuthInfo=socket=/my/non-default/location/of/munge.socket I'd be grateful for any pointer whether this is a genuine bug/regression or not. Many thanks, Michael.

3 3

Possible to require jobs to use NVLink-ed pairs of GPUs?
by Marcus Lauer 27 Oct '25

27 Oct '25

One of our researchers asked whether it was possible to require a job to use NVLink-ed pairs of GPUs. I see that there is a support ticket on the SchedMD site which covers this (https://support.schedmd.com/show_bug.cgi?id=15995). That ticket is a few years old though. Does anyone happen to know whether support for this has been added in newer releases of SLURM? The cluster in question does use "AutoDetect=nvml" in its gres.conf and the output of "slurmd -G" shows that SLURM is aware of the NVLink pairs. I assume the scheduler is trying to use that information. What I want to know is whether there is some way for an end-user to add a constraint (for example) to a job such that it only runs on an NVLink-ed pair of GPUs. I do know that there are other ways to implement this such as requiring jobs to run with even numbers of GPUs, perhaps just on some nodes to allow single GPU jobs to run on the remaining nodes. I'm specifically asking about a flag or setting a user could apply to their jobs. If there is such a thing maybe someone here knows about it. If so I'd love to hear about it. Thanks! -- *Mr. Marcus Lauer* Systems Administrator Penn Engineering University of Pennsylvania https://www.seas.upenn.edu/

1 0

Limit number of allocated GPUs
by Gestió Servidors 23 Oct '25

23 Oct '25

Hello, I have three nodes, serving each one 2 GPUs. I would like to limit (qos??) that a user could user only one GPU from earch server, but user could user simultaneously three GPUs if each GPU belongs to different servers. With this QoS "sacctmgr add qos test-limit-GPUs MaxJobsPerUser=3 MaxTRESPerUser=gres/gpu=1" I can limit to one GPU, but then user can't run other job in a GPU from other server. How must I configure QoS (or other method) to allow more than one job requesting GPUs but never in the same server? Thanks.

3 2

Job remains "PENDING" with reason "QOSMaxGRESPerUser"
by Gestió Servidors 23 Oct '25

23 Oct '25

Hello, I have added a new "qos" with these parameters: sacctmgr add qos test-GPUs MaxJobsPerUser=6 MaxTRESPerUser=gres/gpu=1 MaxSubmitJobsPerUser=25. With it, I only allow 6 running jobs per user, a total of 25 pending+running job per user and only 1 GPU. I have applied this qos directly to a partition in slurm.conf. When a user submits to that partition requesting 2 or more GPUs, job remains "PD" (pending) and notifies "QOSMaxGRESPerUser" in NODELIST column, but I would like to know if it would be possible to direcly reject job and avoid that job remains at queue? For example, if I submit 50 jobs, after number 25 I get message "sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) sbatch: error: QOSMaxSubmitJobPerUserLimit" 25 times) Thanks.

2 1

2025

2024