- slurm-users - lists.schedmd.com

Build 22.05.8 for Debian Trixie?
by Steffen Grunewald 17 Dec '25

17 Dec '25

Good afternoon, we're still running our HPC cluster on Debian 12 Bookworm, which comes with Slurm 22.05.8 - no issues so far. Yesterday, another machine was added as a submit node (slurm-client only), with Debian 13 Trixie / Slurm 24.11.5. Now on that the commands (sinfo etc.) don't work and the slurmctld logs "Incompatible versions of client and server code". Since upgrading the whole cluster is not an option yet: Has someone attempted (and possibly succeeded) to build 22.05.8 (or 23.11.4? which IIRC would be the last compatible one) for Trixie, and would be willing to share ideas? Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

3 4

pam_slurm_adopt malfunct after slurm upgrade to 25.11
by taleintervenor＠sjtu.edu.cn 17 Dec '25

17 Dec '25

Hello everyone: We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got broken. This cause the users cannot ssh to the compute node where their jobs is running on. Of course we have make sure all the slurm related packages have been upgraded together. Simplest test process is like: ``` > cat test1.sh #!/bin/bash #SBATCH --job-name=test #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --nodelist=cas639 sleep 6000 > sbatch test1.sh Submitted batch job 51047091 > squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 51047091 debug test tester R 0:28 1 cas639 > ssh cas639 (wait for a long time...) Connection closed by 172.16.3.129 port 22 (finally failed to ssh) ``` on the target compute node, we can see the following debug message from pam_slurm_adopt.so: ``` cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default) cas639 pam_slurm_adopt[1007301]: debug: slurm_conf_init: using config_file=/etc/slurm/slurm.conf cas639 pam_slurm_adopt[1007301]: debug: Reading slurm.conf file: /etc/slurm/slurm.conf cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide option and cannot be set at partition level, option ignored. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x190b00 cas639 pam_slurm_adopt[1007301]: debug: auth/munge: init: loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/certgen_script.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Certificate generation script plugin type:certgen/script version:0x190b00 cas639 pam_slurm_adopt[1007301]: debug: certgen/script: init: loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x190b00 cas639 pam_slurm_adopt[1007301]: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/tls_none.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Null tls plugin type:tls/none version:0x190b00 cas639 pam_slurm_adopt[1007301]: debug: tls/none: init: tls/none loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_slurmdbd.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Accounting storage SLURMDBD plugin type:accounting_storage/slurmdbd version:0x190b00 cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/cred_munge.so cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x190b00 cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential signature plugin loaded cas639 pam_slurm_adopt[1007301]: debug3: Success. cas639 pam_slurm_adopt[1007301]: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.extern cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.batch cas639 pam_slurm_adopt[1007301]: Connection by user tester: user has only one job 51047091 cas639 pam_slurm_adopt[1007301]: debug: _adopt_process: trying to get StepId=51047091.extern to adopt 1007301 cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_add_extern_pid cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_get_x11_display cas639 pam_slurm_adopt[1007301]: debug: entering stepd_get_namespace_fd ``` It looks like something block during stepd_get_namespace_fd? And we found nothing in slurmd log even with SlurmdDebug = debug5, so I guess the pam module had not run to the step to talk with slurmd (if it should). Would it be a compatibility problem between slurm25.11 and EL8 system or cgroup/v1? Or can anyone help to give some suggestion on how to further locate the fault point? Best regards, Hermes

4 6

Slurm version 25.11.1 is now available
by Marshall Garey 16 Dec '25

16 Dec '25

We are pleased to announce the availability of Slurm version 25.11.1. This release fixes two critical bugs when upgrading which caused older jobs to be aborted and slurmstepd to crash, a regression in 25.11 that broke pam_slurm_adopt, some stability issues in slurmctld, and various other minor to moderate bugs. The full list of changes are available in the CHANGELOG file: https://github.com/SchedMD/slurm/blob/slurm-25.11/CHANGELOG/slurm-25.11.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

opposite of --array
by Herbert Fruchtl 15 Dec '25

15 Dec '25

Folks, When I use sacct, it always shows all tasks of an array job individually. This is supposed to be switched on by --array, but it seems to be the default. How do I switch it off? Thanks in advance, Herbert -- Herbert Fruchtl (he/him) Senior Scientific Computing Officer / HPC Administrator School of Chemistry, IT Services University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532

1 0

'sacct -a' missing running jobs 'held' by user
by Lee 15 Dec '25

15 Dec '25

Hello, I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users. Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs. This included 'holding' the Running array jobs. Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query. However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present. *Question* : 1. How can I include the 'held' running job when I do my bulk query with `sacct -a`? Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible. *Minimum working example *: #. Submit a job : myuser@clusterb01:~$ srun --pty bash # landed on dgx29 #. Hold job myuser@clusterb01:~$ scontrol hold 120918 myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=0 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*JobHeldUser* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser Power= #. Release job myuser@clusterb01:~$ scontrol release 120918 #. Show job again myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=1741 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*None* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser/ Power= #. In slurmctld, I see : root@clusterb01:~# grep 120918 /var/log/slurmctld [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resources JobId=120918 NodeList=dgx29 usec=1269 [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456 [2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918 [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189 [2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456 [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done #. Job is NOT missing, when identifying it by jobid myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" JobIDRaw JobID NodeList Start End Elapsed State SubmitLine ------------ ------------ --------------- ------------------- ------------------- ---------- ---------- ------------------------------ 120918 120918 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash 120918.0 120918.0 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash #. Job IS *missing* when using -a myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" | grep -i 120918 ## *MISSING* Best regards, Lee

1 0

SchedMD joins forces with NVIDIA
by Danny Auble 15 Dec '25

15 Dec '25

Today marks a new journey in the history of SchedMD - we are joining forces with NVIDIA. NVIDIA is committed to developing and supporting Slurm as the open-source vendor-neutral workload manager it has always been. For more details, please see NVIDIA's blog post: https://blogs.nvidia.com/blog/nvidia-acquires-schedmd/ Danny

1 0

Slurm 25.05 and 25.11 release video, slides from recent presentations
by Tim Wickberg 12 Dec '25

12 Dec '25

The release video covering what's new in Slurm 25.05 and 25.11 is online now on the SchedMDSlurm YouTube channel: https://youtu.be/gk42j2_kxFg The is the same content that we presented at the Slurm Community Birds-of-a-Feather session at SC'25 last month, but hopefully in a more broadly accessible location. The slides are in the publication archive as usual, as well as the other recent presentations from SC'25 and KubeCon NA 2025: https://www.schedmd.com/publications/ The direct link to the release sides is: http://slurm.schedmd.com/SC25/Slurm_BoF_SC25.pdf - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

GPU usage in fair share factor
by Massimo Sgaravatto 12 Dec '25

12 Dec '25

Dear all I have these settings in my slurm.conf: PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityFavorSmall=NO PriorityMaxAge=10-0 PriorityWeightAge=100000 PriorityWeightFairshare=1000000 PriorityWeightJobSize=100000 PriorityWeightPartition=100000 PriorityWeightQOS=100000 PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE AccountingStorageTRES=gres/gpu,gres/gpu:nvidia-h100,gres/gpu:nvidia-l40s If I have got it right, with these settings, when calculating the Job_priority (I am referring to the formula in https://slurm.schedmd.com/priority_multifactor.html) the fair-share_factor is calculated only considering cores*seconds. So if a job also used some GPUs, this is not taken into account. Am I right ? If I want to take memory and GPUs into account as well (with different weights for different GPU models), my understanding is that I should use the TRESBillingWeights attribute, setting e.g: TRESBillingWeights="CPU=1.0,Mem=0.25G,gres/gpu=nvidia-h100=10.0,gres/gpu=nvidia-l40s=5.0" Is this correct ? Thanks, Massimo

2 4

Problems with salloc
by Gestió Servidors 10 Dec '25

10 Dec '25

Hello, I am getting some problems with "salloc" execution. My cluster is running SLURM v23.11.0. I'm going to explain problems: 1. My first problem is that until some previous version (I can't remember what version exactly), when I ran a "salloc", I got a console inside the assigned host. Then, I could execute whatever I wanted. However, now, after execute "salloc" (without forcing host), SLURM always tries to assign me the last host where a job has been executed. But if that host is now powered off, salloc waits and waits but not tries to connect to other host 2. Second problem is that previous execution gives me a host (the host is powered on), my session continues in the allocation host, not in the assigned host. So if I want to execute some program, I need to run with "srun". Is there anyway to go directly to the assigned host? 3. Third problem is a question. How can I limit the hosts where I (as cluster admin) allow to run a "salloc"? For example, I would like that when a user runs a "salloc", it were execute always in a partition with "x" hosts, avoiding SLURM to assign me a host of the list of hosts. Thanks.

1 0

Invalid generic resource (gres) specification after RMA
by Lee 04 Dec '25

04 Dec '25

Good afternoon, I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue. *Symptoms : * 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)" 2. When I submit a job to this node, I get : $ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification ### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2* ### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done *Configuration : * 1. gres.conf : # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local 3. What slurmd detects on dgx09 root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10 root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML *Questions : * 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'? 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node? 3. Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes? Best regards, Lee

6 12

Doubts Regarding Definition of Metrics AveRSS and AveDiskWrite
by Manuel G. Marciani 03 Dec '25

03 Dec '25

Hello, fellow users: I have been using Slurm for the past three years, but recently I bumped into a doubt. I am using Slurm's (version 23.02.7) collected metrics (jobacct_gather/linux) to do a performance analysis of an application. I have read the documentation regarding the metrics (https://slurm.schedmd.com/sacct.html) but still find the Ave* metrics confusing, and more specifically the AveRSS and AveDiskWrite. AveDiskWrite is defined as "Average number of bytes written by all tasks in job." So, if I double the workload, which say that it had x avediskwrite, I should observe 2x. So far it is what I observed. But, then, if I double the resources while maintaining the workload I observe again x, and not 2x. So my suspicion is that the metric is the sum of written bytes across time, then divided by the number of nodes. But then with AveRSS, defined as "Average resident set size of all tasks in job," I observe what I expected with AveDiskWrite. That is, that this metric scales with the workload irrespective of the resources it has available. So I am not sure what the "Ave" references here. I would be thankful if someone could clarify the behavior, and even more grateful if someone could point me where in the code these metrics are aggregated and processed to be stored in the database. Many thanks, Manu. -- Manuel G. Marciani - マルシアニ·マヌエル First Stage Reseacher at Computational Earth Sciences (CES) - Earth Sciences Department Barcelona Supercomputing Center - Centro Nacional de Supercomputación Ph.D Student at Departament d'Arquitectura de Computadors (DAC) - Facultat d'Informàtica de Barcelona (FIB) Universitat Politècnica de Catalunya (UPC) BSC building Plaça Eusebi Güell, 1-3, 08034 Barcelona, Spain Desk BSC-PL0-6-20/8 mail to: manuel.gimenez(a)bsc.es mail to: manuel.gimenez.de.castro(a)upc.edu mail to: manuel.marciani(a)a.riken.jp mail to: manuel.gimenez.de.castro.marciani(a)hu-berlin.de

1 0

MIG and eBPF issues (Slurm 24.11.6)
by wk5ng＠uwaterloo.ca 28 Nov '25

28 Nov '25

Hi all, I'm having some trouble getting Slurm 24.11.6 to work with MIG, and the slurmd logs seem to point to an issue with eBPF. For some context, this is an LXD unprivileged container where I'm trying to get MIG to work with Slurm. Other compute nodes without MIG work fine and isolate the GPUs accordingly. What I'm seeing in slurmd logs: [2025-11-24T23:32:50.197] [331.interactive] cgroup/v2: cgroup_p_constrain_apply: CGROUP: EBPF Closing and loading bpf program into /sys/fs/cgroup/system.slice/slurmstepd.scope/job_331 [2025-11-24T23:32:50.197] [331.interactive] error: load_ebpf_prog: BPF load error (Operation not permitted). Please check your system limits (MEMLOCK). I've tried increasing the system limits for MEMLOCK by setting DefaultLimitMEMLOCK=infinity in /etc/systemd/system.conf, and I've copied my slurmd.service file below where I've set Delegate=yes and LimitMEMLOCK=infinity. Previously only Delegate=yes wasn't set (I had rifled through the cgroupv2 documentaton for Slurm and found that setting), but in both cases I see the same BPF load error. Just wondering if this was something that other people had come across before and maybe I'm doing something silly here. I've checked that my slurm.conf has the corresponding parameters set according to Slurm's own documentation for cgroup.conf and my cgroup.conf is also copied below. Some portion of the gres.conf is also copied below, and even though I tried AutoDetect=nvml for this node, it's still doesn't work, which was why I changed to manually setting it based off the output of slurmd -G. Maybe I should try switching back to cgroupv1 and see if that helps fix things, but I'm not sure at this point if MIG and Slurm are compatible using cgroupv1. I can send other parts of logs, configuration files etc. Any help would be greatly appreciated! ###### slurmd.service file [Unit] Description=Slurm node daemon After=network.target munge.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=51200 Delegate=yes LimitMEMLOCK=infinity LimitSTACK=infinity [Install] WantedBy=multi-user.target ###### cgroup.conf CgroupPlugin=autodetect ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ##### gres.conf NodeName=gpu-3 AutoDetect=nvml Name=gpu NodeName=gpu-4 Name=gpu MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31

3 4

Slinky version 1.0.0 is now available
by Marrlow Warnicke 20 Nov '25

20 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0! Slinky is SchedMD's set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages Apologies for the redundant email-we needed to clarify the release version. -- Marlow Warnicke Principal Cloud Engineer Commercial Slurm - and Slinky - Development and Support

1 0

Slinky version 1.0.0 is now available
by Marrlow Warnicke 20 Nov '25

20 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0-rc1. Slinky is SchedMD’s set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages -- Marlow Warnicke Principal Cloud Engineer, SchedMD LLC Commercial Slurm - and Slinky - Development and Support

1 0

Recommended Stable Slurm Version for >100P Scale Clusters
by KK 18 Nov '25

18 Nov '25

We are currently planning to deploy a new HPC system with a total compute capacity exceeding 100 PF. As part of our preparation, we would like to understand which Slurm versions are considered stable and widely used at this scale. Could you please share your recommendations or experience regarding: 1. Which Slurm version is currently running reliably on very large-scale clusters (>100 PF or >10k nodes)? 2. Whether there are any versions we should avoid due to known issues at large scale. 3. Any best practices or configuration considerations for Slurm deployments of this size.

3 2

Slurm versions 25.05.5 and 24.11.7 are now available
by Marshall Garey 11 Nov '25

11 Nov '25

We are pleased to announce the availability of Slurm versions 25.05.5 and 24.11.7. These releases fix some mild to moderate severity issues and potential crashes. The full list of changes are available in the CHANGELOG files for each version: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md https://github.com/SchedMD/slurm/blob/slurm-24.11/CHANGELOG/slurm-24.11.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Split node into 2 partitions
by Ratnasamy, Fritz 11 Nov '25

11 Nov '25

Hi, Is there a straightforward way to split a gpu node with 8 GPUs into 2 partitions with 4 GPUs on each? How do we edit the slurm.conf and gres.conf in this case? Thanks, *Fritz Ratnasamy*Data Scientist Information Technology

3 2

Slurm Community Survey 2025
by Tim Wickberg 10 Nov '25

10 Nov '25

With the Slurm 25.11 release out, and the 1.0.0rc1 for the Slinky Project done, we're quickly shifting into conference season. SchedMD staff are presenting at KubeCon North America on Slinky [1] this week. We don't have a booth here, but feel free to say hi if you see any of us, or send me a message on the Kubernetes or CNCF Slack channels (@wickberg) if you have something you'd like to discuss in person. Next week we'll be manning the Slurm Booth at SC25 [2], as well as hosting the annual Slurm Community Birds-of-a-Feather session [3] on Thursday from 12:15-1:15pm. For this year we wanted to to make the Slurm Community BoF survey available ahead of time, and to open it to a wider audience. This'll let us prepare some initial results ahead of the BoF (while still trying to update live data during the BoF), and the results from this are invaluable as we plan for future Slurm releases. The survey is available now: https://schedmd.com/survey For those not at SC25, we'll have a brief set of highlights from this survey included in the Slurm 25.11 release overview video on our YouTube channel in December. - Tim [1] https://kccncna2025.sched.com/event/27FW5/ [2] The Slurm Booth is #1641. We have a new halo banner this year as well. [3] https://sc25.conference-program.com/presentation/?id=bof101&sess=sess471 [4] https://www.youtube.com/SchedMDSlurm -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

sreport and AccountUtilizationByUser
by Gestió Servidors 08 Nov '25

08 Nov '25

Hello, I woud like to get a full report that includes CPU usage of an account (detalied by user). I have run "sreport cluster AccountUtilizationByUser start=9/1 end=11/2 account=mygr" and the output is this: -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2025-09-01T00:00:00 - 2025-11-01T23:59:59 (5360400 secs) Usage reported in CPU Minutes -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- --------- -------- mycluster mygr 3979368 0 mycluster mygr mygr-1 3979368 0 mycluster mygr mygr-10 3979368 0 mycluster mygr mygr-11 3979368 0 mycluster mygr mygr-12 3979368 0 mycluster mygr mygr-13 3979368 0 mycluster mygr mygr-14 3979368 0 mycluster mygr mygr-15 3979368 0 mycluster mygr mygr-16 3979368 0 mycluster mygr mygr-17 3979368 0 mycluster mygr mygr-18 3979368 0 mycluster mygr mygr-19 3979368 0 mycluster mygr mygr-2 3979368 0 mycluster mygr mygr-20 3979368 0 mycluster mygr mygr-21 3979368 0 mycluster mygr mygr-22 3979368 0 mycluster mygr mygr-23 3979368 0 mycluster mygr mygr-24 3979368 0 mycluster mygr mygr-25 3979368 0 mycluster mygr mygr-26 3979368 0 mycluster mygr mygr-27 3979368 0 mycluster mygr mygr-28 3979368 0 mycluster mygr mygr-29 3979368 0 mycluster mygr mygr-3 3979368 0 mycluster mygr mygr-30 3979368 0 mycluster mygr mygr-31 3979368 0 mycluster mygr mygr-32 3979368 0 mycluster mygr mygr-33 3979368 0 mycluster mygr mygr-34 3979368 0 mycluster mygr mygr-35 3979368 0 mycluster mygr mygr-36 3979368 0 mycluster mygr mygr-37 3979368 0 mycluster mygr mygr-38 3979368 0 mycluster mygr mygr-39 3979368 0 mycluster mygr mygr-4 3979368 0 mycluster mygr mygr-40 3979368 0 mycluster mygr mygr-5 3979368 0 mycluster mygr mygr-6 3979368 0 mycluster mygr mygr-7 3979368 0 mycluster mygr mygr-8 3979368 0 mycluster mygr mygr-9 3979368 0 I don't understand why column "Used" shows always the same value, because each user "mygr-XX" has used the cluster in differents ways, times and each used has submitted very different number of jobs, as you can see in this "sshare" output: [root@login bin]# sshare -l -A mygr Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- mygr 1 0.250000 3941705 0.337402 0.961033 0.260137 mygr mygr-1 1 0.024390 32099 0.002748 0.008144 0.095146 2.995008 mygr mygr-10 1 0.024390 98343 0.008418 0.024949 0.067961 0.977586 mygr mygr-11 1 0.024390 213838 0.018304 0.054250 0.052427 0.449588 mygr mygr-12 1 0.024390 134149 0.011483 0.034033 0.056311 0.716657 mygr mygr-13 1 0.024390 89304 0.007644 0.022656 0.071845 1.076537 mygr mygr-14 1 0.024390 524836 0.044925 0.133150 0.040777 0.183179 mygr mygr-15 1 0.024390 303494 0.025979 0.076996 0.044660 0.316774 mygr mygr-16 1 0.024390 99020 0.008476 0.025121 0.066019 0.970897 mygr mygr-17 1 0.024390 296341 0.025366 0.075181 0.046602 0.324421 mygr mygr-18 1 0.024390 37809 0.003236 0.009592 0.085437 2.542696 mygr mygr-19 1 0.024390 72158 0.006177 0.018306 0.077670 1.332337 mygr mygr-2 1 0.024390 225658 0.019316 0.057249 0.050485 0.426039 mygr mygr-20 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-21 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-22 1 0.024390 115042 0.009847 0.029186 0.062136 0.835683 mygr mygr-23 1 0.024390 54169 0.004637 0.013743 0.081553 1.774775 mygr mygr-24 1 0.024390 125767 0.010765 0.031907 0.058252 0.764422 mygr mygr-25 1 0.024390 243088 0.020808 0.061671 0.048544 0.395490 mygr mygr-26 1 0.024390 35516 0.003040 0.009010 0.089320 2.706902 mygr mygr-27 1 0.024390 65281 0.005588 0.016562 0.079612 1.472682 mygr mygr-28 1 0.024390 118203 0.010118 0.029988 0.060194 0.813335 mygr mygr-29 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-3 1 0.024390 107543 0.009206 0.027284 0.064078 0.893956 mygr mygr-30 1 0.024390 33946 0.002906 0.008612 0.091262 2.832099 mygr mygr-31 1 0.024390 183059 0.015670 0.046442 0.054369 0.525180 mygr mygr-32 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-33 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-34 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-35 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-36 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-37 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-38 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-39 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-4 1 0.024390 37605 0.003219 0.009541 0.087379 2.556486 mygr mygr-40 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-5 1 0.024390 81114 0.006943 0.020578 0.073786 1.185232 mygr mygr-6 1 0.024390 91873 0.007864 0.023308 0.069903 1.046428 mygr mygr-7 1 0.024390 80513 0.006892 0.020426 0.075728 1.194082 mygr mygr-8 1 0.024390 367822 0.031485 0.093316 0.042718 0.261374 mygr mygr-9 1 0.024390 33630 0.002879 0.008532 0.093204 2.858713 mygr mygr-tutor 1 0.024390 40472 0.003464 0.010268 0.083495 2.375427 What am I doing wrong? What I need to get is a detalied report from all "mygr" users (mygr account) from CPU usage and, if it is possible, other values. Thanks.

2 1

Slinky version 1.0.0rc1 is now now available
by Marrlow Warnicke 07 Nov '25

07 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0-rc1. Slinky is SchedMD’s set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages -- Marlow Warnicke Principal Cloud Engineer, SchedMD LLC Commercial Slurm - and Slinky - Development and Support

1 0

submit host DNS lookup errors stall slurmctld, fail nodes, kill jobs
by moorehfl＠amazon.com 06 Nov '25

06 Nov '25

As we've scaled up our slurm usage, we've noticed that short, moderate bursts of DNS lookup failures are enough to regularly stall slurmctld: _xgetaddrinfo: getaddrinfo(runner-t35knco5d-project-54-concurrent-0:37251) failed ...this has a cascading effect where, when stalled, the controlled can't always communicate with nodes: error: Error connecting, bad data: family = 0, port = 0 ...and immediately the controller will mark the nodes as unhealthy, and kill jobs: slurmctld: Killing JobId=3120751 on failed node slurm-0f6cacdc1 The reason for the DNS failures is not unreliable DNS server or network, but rather the jobs are submitted by containers that don't have resolvable hostnames. This traditionally hasn't disrupted functionality, but we've noticed if 8-10 jobs all terminate at the same time (submitter container SIGTERMs the srun process) that the controller can be easily overloaded for several seconds, despite having significant free system resources. gdb confirms the process is hanging on DNS. We also can see "Socket timed out on send/recv operation" from clients attempting to interact with the controller during the issue. slurm 24.11.0 RHEL 8.10 kernel 4.18.0-553.58.1.el8_10.x86_64 We're looking into ways to get our ephemeral job submitter containers resolvable in DNS to prevent lookup failures (either by giving them resolvable hostnames, or by blackholing the records to 0.0.0.0 to allow for fast local failure on the slurmctld server). However, it does seem unusual for a handful of bad DNS lookups to cause so much disruption in slurmctld. Is this a known weak point of ctld? The slurmctld host is a single-purpose 16 vCPU 30GB EC2 instance with minimal load. We have ~150 nodes, all nodes have valid IPs in slurm.conf to remove the need for ctld to perform lookups for nodes, but apparently there is still a need to lookup the submit host as well, and we can reliably reproduce these cascading failures. Another possiblity might be to extend SlurmdTimeout to something very long and hope that the controller recovers from its stall in enough time to prevent from marking nodes as unhealthy and killing jobs, but it's not clear if that will have any effect since the first occurrence of "error: Error connecting, bad data: family = 0, port = 0" immediately drains nodes and kills jobs. Thanks

1 0

Slurm version 25.11 is now available
by Marshall Garey 06 Nov '25

06 Nov '25

We are pleased to announce the availability of Slurm 25.11. The release notes summarizing the new features, and including links to the corresponding documentation, can be found at: https://slurm.schedmd.com/release_notes.html A more extensive list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.11/CHANGELOG/slurm-25.11.md The Slurm documentation has also been updated to the 25.11 release: https://slurm.schedmd.com Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

CPU frequency setting not configured for this node
by Felix 06 Nov '25

06 Nov '25

Hello I get this error: "CPU frequency setting not configured for this node" in some of my node. I do not know what to do, can you please advice. Thank you Felix -- Dr. Eng. Farcas Felix National Institute of Research and Development of Isotopic and Molecular Technology, IT - Department - Cluj-Napoca, Romania Mobile: +40742195323

1 0

Move to higher qos when cluster idle
by Ratnasamy, Fritz 06 Nov '25

06 Nov '25

Hi, Is there an option to automatically have users move to a higher qos when the cluster is x% idle? I was reading about customizing the job_submit.lua script in order to do that but I was wondering if it makes sense. We have a gpu cluster and many users complain they can not allocate more than their current qos when the cluster is not very busy. Best, *Fritz Ratnasamy*Data Scientist Information Technology

1 0

Slurm release candidate version 25.11.0rc1 is available for testing
by Tim Wickberg 28 Oct '25

28 Oct '25

We are pleased to announce the availability of Slurm release candidate 25.11.0rc1. To highlight some new features coming in 25.11: * Added new "Expedited Requeue" mode for batch jobs. Batch jobs with --requeue=expedite will automatically requeue on node failure, or if the batch script returns a non-zero exit code and one or more Epilog scripts fail. Expedited requeue jobs are eligible to restart immediately, are treated as the highest priority job in the system, and their previously allocated set of nodes will be prevented from launching other work. * Added a new "Mode 3" of operation to Hierarchical Resources. This mode complements the existing Mode 1 and Mode 2 by summing usage from lower levels automatically. This can be used, e.g., to implement a power-capping mode modeling power distribution between the datacenter, local distribution, and individual racks. * Added direct support for exporting OpenMetrics (Prometheus) telemetry from slurmctld. This is accessible on SlurmctldPort on SlurmctldHost by default, or can be disabled if desired. * Added an experimental asynchronous-reply mode to slurmctld. If enabled with "SlurmctldParameters=enable_async_reply", RPC responses are offloaded to the kernel for further processing, freeing individual worker threads for new traffic. This is the first release candidate of the upcoming 25.11 release series, and represents the end of development for this release, and a finalization of the RPC and state file formats. If any issues are identified with this release candidate, please report them through https://bugs.schedmd.com against the 25.11.x version and we will address them before the first production 25.11.0 release is made. Please note that the release candidates are not intended for production use. A preview of the updated documentation can be found at https://slurm.schedmd.com/archive/slurm-master/ . Slurm can be downloaded from https://www.schedmd.com/download-slurm/. The changelog for 25.11 can be found here: https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-25.11.md -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

AuthInfo broken in 25.05.1 ?
by michael＠mayer.cx 27 Oct '25

27 Oct '25

I am using the AuthInfo in both slurm.conf and slurmdbd.conf to point to a different location for the munge socket. This works as expected in slurmdbd, but it does not work for slurmctld. In slurmctld it seems to ignore this setting and try to locate the munge socket at the default location and then fail. When using the same approach in 23.11.11, for example, AuthInfo is working as expected for both slurmctld and slurmdbd. Sample config from slurm.conf and slurmdbd.conf AuthType=auth/munge AuthInfo=socket=/my/non-default/location/of/munge.socket I'd be grateful for any pointer whether this is a genuine bug/regression or not. Many thanks, Michael.

3 3

Possible to require jobs to use NVLink-ed pairs of GPUs?
by Marcus Lauer 27 Oct '25

27 Oct '25

One of our researchers asked whether it was possible to require a job to use NVLink-ed pairs of GPUs. I see that there is a support ticket on the SchedMD site which covers this (https://support.schedmd.com/show_bug.cgi?id=15995). That ticket is a few years old though. Does anyone happen to know whether support for this has been added in newer releases of SLURM? The cluster in question does use "AutoDetect=nvml" in its gres.conf and the output of "slurmd -G" shows that SLURM is aware of the NVLink pairs. I assume the scheduler is trying to use that information. What I want to know is whether there is some way for an end-user to add a constraint (for example) to a job such that it only runs on an NVLink-ed pair of GPUs. I do know that there are other ways to implement this such as requiring jobs to run with even numbers of GPUs, perhaps just on some nodes to allow single GPU jobs to run on the remaining nodes. I'm specifically asking about a flag or setting a user could apply to their jobs. If there is such a thing maybe someone here knows about it. If so I'd love to hear about it. Thanks! -- *Mr. Marcus Lauer* Systems Administrator Penn Engineering University of Pennsylvania https://www.seas.upenn.edu/

1 0

Limit number of allocated GPUs
by Gestió Servidors 23 Oct '25

23 Oct '25

Hello, I have three nodes, serving each one 2 GPUs. I would like to limit (qos??) that a user could user only one GPU from earch server, but user could user simultaneously three GPUs if each GPU belongs to different servers. With this QoS "sacctmgr add qos test-limit-GPUs MaxJobsPerUser=3 MaxTRESPerUser=gres/gpu=1" I can limit to one GPU, but then user can't run other job in a GPU from other server. How must I configure QoS (or other method) to allow more than one job requesting GPUs but never in the same server? Thanks.

3 2

Job remains "PENDING" with reason "QOSMaxGRESPerUser"
by Gestió Servidors 23 Oct '25

23 Oct '25

Hello, I have added a new "qos" with these parameters: sacctmgr add qos test-GPUs MaxJobsPerUser=6 MaxTRESPerUser=gres/gpu=1 MaxSubmitJobsPerUser=25. With it, I only allow 6 running jobs per user, a total of 25 pending+running job per user and only 1 GPU. I have applied this qos directly to a partition in slurm.conf. When a user submits to that partition requesting 2 or more GPUs, job remains "PD" (pending) and notifies "QOSMaxGRESPerUser" in NODELIST column, but I would like to know if it would be possible to direcly reject job and avoid that job remains at queue? For example, if I submit 50 jobs, after number 25 I get message "sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) sbatch: error: QOSMaxSubmitJobPerUserLimit" 25 times) Thanks.

2 1

dynamic node slurm node list order
by Assmann, Greta Marie 23 Oct '25

23 Oct '25

Hello, we would like to understand how the internal SLURM node list order works. More detailed info: Our setup: We have a Slurm cluster with N dynamic nodes (heterogeneous node types) in a partition and we observe that if all resources are free, jobs always get submitted to one specific node first. This node is also the node that turns up at the top of the list when doing an $ scontrol show nodes . As we do not have a defined node list in slurm.conf as it is done for non-dynamic nodes, we were wondering how the order is set up. Interestingly, when deleting (and unregistering) this node from the cluster and re-registering it again, the node is still at the same position when doing scontrol show nodes. Is there some internal node list order caching or similar? How is the node list order defined ? Thanks , Greta Dr. Greta Assmann Data Analysis and Research Infrastructure OBBA 230 Forschungsstrasse 111 5232 Villigen-PSI email: greta.assmann(a)psi.ch

4 5

Slurm and Slinky events this fall, update events page
by Tim Wickberg 14 Oct '25

14 Oct '25

We'll have a bit more details as conference season quickly approaches this November, but SchedMD staff are presenting at KubeCon NA on Slinky [1]. We'll be manning the Slurm Booth at SC25 [2], as well as hosting the annual Slurm Community Birds-of-a-Feather session [3]. I'll also send a link out to the survey questions for the BoF to the slurm-users list ahead of the conference, and we'll be going into a bit more depth on the answers during the BoF this year. The events page on the SchedMD website has more detail on future events as well: https://www.schedmd.com/events/ A few folks had asked, and apparently we never mentioned this more publicly, but: SchedMD does not plan to hold an in-person SLUG in 2025 or 2026. We are working to bring some of the same content to our YouTube channel [4] as a way to more broadly disseminate some of that same content, starting with the Slurm 25.11 release overview in December. - Tim [1] https://kccncna2025.sched.com/event/27FW5/ [2] The Slurm Booth is #1641. [3] https://sc25.conference-program.com/presentation/?id=bof101&sess=sess471 [4] https://www.youtube.com/SchedMDSlurm -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

Slurm version 25.05.4 is now available
by Marshall Garey 14 Oct '25

14 Oct '25

We are pleased to announce the availability of Slurm version 25.05.4. This version increases the default number of maximum connections to slurmctld from 50 to 512, fixes a regression added in 25.05.2 that broke compatibility with PMIx v2.x through v3.1.0rc1, and fixes other minor to moderate bugs. The full list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Preemption Scheduler Tuning Knobs
by Reed Dier 13 Oct '25

13 Oct '25

Hoping someone may be able to help demystify some questions around the scheduler and preemption decisions. I'm trying to help better predict the scheduler behavior as it pertains to preemption to create more predictable scheduling. Version is 24.11.5, OS U22.04 I have a fairly simple lo/hi partition which the same set of nodes are assigned > PartitionName=partition-lo Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=FORCE:1 State=UP > PartitionName=partition-hi Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=NO State=UP PreemptMode=OFF And then I have 2 QOSs per partition > Name Priority GraceTime Preempt PreemptMode UsageFactor MaxJobsPU MaxTRESPA MaxJobsPA > ---------------- ---------- ---------- ---------------------- ----------- ----------- --------- ------------- --------- > qos-stateless-lo 1 00:00:00 requeue 1.000000 > qos-stateful-lo 1 00:00:00 suspend 1.000000 NN cpu=NNN NNN > qos-stateful-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000 cpu=NNN > qos-stateless-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000 And the general way that it works out is that stateful jobs will typically spawn stateless jobs. Stateful jobs will get suspended, while stateless jobs will get requeued. And then there are some general guard rails around stateful jobs clogging the queue and preventing stateless jobs from scheduling and creating a deadlock, but thats not the specific issue here. > SchedulerType=sched/backfill > SelectType=select/cons_tres > SelectTypeParameters=CR_CPU_Memory > SchedulerParameters=max_rpc_cnt=500,\ > sched_min_interval=50000,\ > sched_max_job_start=300,\ > batch_sched_delay=6 > PriorityType=priority/multifactor > PreemptType=preempt/qos > PreemptMode=SUSPEND,GANG Workload is rather "high throughput", so a few settings were influenced by that <https://slurm.schedmd.com/high_throughput.html> guide. What I end up seeing is that it can take sometimes, but not always, take a while for lo jobs to be preempted by hi jobs, an example below, but if the mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a This is a rather low resolution view, as it is just a scraper running on 5 minute increments, but the blue filled area are "hi" jobs running, where the yellow are "lo" jobs running. Second image (dashed lines) are pending jobs for the same partitions. Oddly, this specific instance did not show any preemption events in the slurmctld logs, but users/admins were a bit perplexed as to why this drug on for so long, and without preemption kicking in. I was considering looking deeper into to try to better understand and predict preemption decisions: default_queue_depth partition_job_depth sched_interval sched_min_interval defer Hopefully someone can point me to some nuggets of information around this? Appreciate any pointers, Reed

1 0

About tpology.conf
by ysaiki18＠gmail.com 12 Oct '25

12 Oct '25

Must topology.conf, like slurm.conf, be accessible to all nodes?

3 6

Prolog error causing node to drain
by Ratnasamy, Fritz 11 Oct '25

11 Oct '25

Hi, We have a running slurm cluster and users have been submitting jobs for the past 3 months without any issues. Recently, some nodes jobs are getting drained randomly due to the reason "prolog error" Our slurm.conf has these 2 lines regarding prolog: PrologFlags=Contain,Alloc,X11 Prolog=/slurm_stuff/bin/prolog.d/prolog* Inside the prolog.d folder, there are 2 scripts which run with no errors as far as I can see but is there a way to debug why the nodes are going in draining mode once in a while because of "prolog error"? That seems to happen at random times and on random nodes. From the log file, I can see only this: Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: prolog failed: rc:230 output:Successfully started proces> Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: [job 20398] prolog failed status=230:0 Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed, do not launch batch job Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon... Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting down. Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing Currently, now the job 20398 that is getting killed in the picture above is in the state "Launch failed requeue held" after I resume the node. *Fritz Ratnasamy*Data Scientist Information Technology

3 2

Testing Slurm on FreeBSD – Offering Feedback and Collaboration
by Rikka Göring 08 Oct '25

08 Oct '25

Hello everyone, I’d like to briefly introduce myself and share some good news. I maintain the FreeBSD port of Slurm (sysutils/slurm-wlm) and have just completed the upgrade to the latest release. Alongside this, several FreeBSD-related fixes I had been maintaining locally have now been merged upstream by SchedMD engineers, which is an encouraging step for portability. With the updated port now running reliably on FreeBSD, I can offer: - Regular build and runtime testing of Slurm on FreeBSD, - Feedback on portability patches and non-Linux code paths, - Collaboration with anyone interested in running or experimenting with Slurm on BSD systems. My goal is to help ensure that Slurm remains portable and usable on FreeBSD, and to contribute back testing and fixes where possible. If you’re working on anything that may affect portability, feel free to reach out — I’d be glad to test and provide results from the FreeBSD side. Best regards, Rikka Göring

1 0

Unexpected missing socket error
by Ozeryan, Vladimir 07 Oct '25

07 Oct '25

Hello everyone, Not sure if you guys have heard this tune already but did anyone come across a solution for "Unexpected missing socket error". There is nothing useful in the logs but the message appears on compute nodes and slurm controller node. Thank you, Vlad Ozeryan AMDS - AB1 Linux-Support Vladimir.Ozeryan(a)jhuapl.edu Ext. 23966

3 2

Setting GrpTRES for specific Account only for specific Partition(s)
by Paul Raines 06 Oct '25

06 Oct '25

I am trying to figure out how one can limit resources like with GrpTRES for specific accounts only for specific partitions in the cluster Lets say I have a cluster with 3 partitions Part1, Part2, and Part3 and 3 accounts Acct1, Acct2, and Acct3 I want Acct1 to be limited to 10 cpus on Part1, Acct2 to be limited to 5 cpus on Part1, and Acct3 to be limited to 8 cpus on Part1. No limits for any account on the other two partitions at present. I have tried # sacctmgr modify account Acct1 where partition=Part1 set GrpTRES=cpu=10 Modified account associations... Nothing modified I have tried creating a QOS to assign # sacctmgr add qos cpus10 # sacctmgr modify qos cpus10 set GrpTRES=cpu=10 # sacctmgr modify account Acct1 where partition=Part1 set qos=cpus10 Modified account associations... Nothing modified I cannot set a Partition QOS since that would apply to all accounts on the partition giving them the same limit I cannot just set a general Account limit (or QOS) since that would apply to all Partitions. I really don't want to have to have different Accounts for each partition so I can set Account wide limits. That would make users have to remember which Account to use with which Partiton. It looks like the following works but requires having to repeat for all users in every Account sacctmgr -i add user User1 account=Acct1 partition=Part1 \ fairshare=parent qos=cpus10 # limits on the Acct1 partition sacctmgr -i add user User1 account=Acct1 \ fairshare=parent # no limits on any other partition But in essense is not much better than having different Accounts for each partition as I now have to maintain multiple user entries per Account: one for partitios with no-limits and and additional one for each partition that needs limits. I guess at least users don't have to remember what account to use when submitting to a partition. Its seems there should be some simpler way to do this. A way to say "this is the limit for this Account on this Partion for all current users and any future users added to the Account" Am I missing something? The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

1 0

cloud compute limits/budget
by Tom Sparks 03 Oct '25

03 Oct '25

I am setting up a single user slurm I have limited budget for cloud gpu compute ($40) I am wondering if slurm has ability to stop me from going over my budget? I have read the slurm's powersaving / configureless nodes guides Are they other guides I should look at?

2 1

SRUN and SBATCH network issues on configless login node.
by Bruno Bruzzo 01 Oct '25

01 Oct '25

Hi all, first of all, sorry for my English, it's not my native language. We are currently experiencing an issue with srun and salloc on our login nodes, while sbatch works properly. slurm version 23.11.4. slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration. login nodes reports the following to the user: $ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash srun: job 4872 queued and waiting for resources srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted uid 202 is the slurm user. On the server side, slurmctld logs show: sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228 sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were transmitted or received Killing interactive JobId=4872: Communication connection failure _job_complete: JobId=4872 WEXITSTATUS 1 _job_complete: JobId=4872 done step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already completed _slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already completing or completed We suspect it is a network issue regarding the Zero Bytes were transmitted or received. The configless system is working properly. Slurmd on login node can read changes made at slurm.conf after a scontrol reconfig. srun runs successfully from management nodes and from compute nodes. The issue is from the login node. scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc. $ scontrol ping Slurmctld(primary) at mmgt01 is DOWN We checked as well for munge consistency. mmgt and login nodes have the hostnames of their respective others on /etc/hosts. They can communicate. We would really appreciate some tips on what we could be missing. Best regards, Bruno Bruzzo System Administrator - Clementina XXI

4 7

Re: How to make TLS and PMIx v4 work together?
by Grigory Shamov 30 Sep '25

30 Sep '25

Forgot to add: the s2n-tls comes from EPEL and is ver 1.5.10. On 2025-09-25, 11:56 AM, "Grigory Shamov via slurm-users" <slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote: Caution! This message was sent from outside the University of Manitoba. Hi All, We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4. We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun . TLSType = tls/s2n TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs) And the errors when using PMIx are 025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer [2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494 (couple of these) [2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37 [2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28 (couple of these) [2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message [2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination [2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV 2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit [2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist: (null) (and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ? There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?). I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks! -- Grigory Shamov Site Lead / HPC Specialist University of Manitoba and DRI Alliance Canada -- slurm-users mailing list -- slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>

3 3

FUTURE nodes do not return to idle on slurmctld restart
by Steve Kirk 26 Sep '25

26 Sep '25

Hello, We've got a few nodes defined in our slurm.conf in 'FUTURE' state as it's a new hardware type we're working on brining into service. The nodes are currently all allocated to a dedicated partition. The partition is configured as 'state=UP'. As we've built the new nodes and started slurmd+munge, they've appeared in an idle state in the new partition as expected. All good so far. However if the slurmctld is restarted the nodes go back to being in 'FUTURE' state, and do not transition to idle, accept jobs etc. The slurm daemon on the new nodes can clearly still talk to the slurmctld, s* commands on the new nodes work as expected but remain in FUTURE state - until slurmd on each node is restarted. I could have misunderstood something about the FUTURE state but I was expecting them to go back to idle; I understand that slurmctld doesn't communicate out to nodes in FUTURE state but I at least expected them to be picked up when they communicate _in_ to the slurmctld. Is this expected behaviour or perhaps a bug? The reason I've defined the new nodes this way so I don't have to update slurm.conf and restart slurmctld as each is built, but can do that as a single job once everything is finished, however it seems less useful if they can 'disappear' from the cluster as far as users are concerned. Cheers, Steve

2 2

How to make TLS and PMIx v4 work together?
by Grigory Shamov 25 Sep '25

25 Sep '25

Hi All, We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4. We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun . TLSType = tls/s2n TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs) And the errors when using PMIx are 025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer [2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494 (couple of these) [2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37 [2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28 (couple of these) [2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message [2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination [2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV 2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit [2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist: (null) (and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ? There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?). I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks! -- Grigory Shamov Site Lead / HPC Specialist University of Manitoba and DRI Alliance Canada

1 0

Node switching randomly to down state
by Julien Tailleur 24 Sep '25

24 Sep '25

Dear all, I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging. My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly. From the master node, I can see that 3 servers on one of the queue appear down: jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 These servers are reachable by SSH/ping jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms --- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms ##### I can also put these nodes back into idle mode: root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 But then, they switch back into down mode few minutes later: root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14] I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior? Best wishes, Julien

4 4

No output and can't job by id
by Dhumal, Dr. Nilesh 22 Sep '25

22 Sep '25

Hello, Recently, we updated Slurm. We are running slrumcld on head node. The slurmdbd is disabled on the head node. The user submits the job and gets the job ID. The squeue is empty, and difficult to track the job progress. The job is not writing any output to a file. How can I resolve this issue? Nilesh Dhumal Associate Professor of Chemistry, http://faculty.fgcu.edu/ndhumal/ Coordinator, FGCU Computational Facility, https://www.fgcu.edu/cas/facultyresources/computationalfacility/ SH-430; Department of Chemistry and Physics Florida Gulf Coast University 10501 FGCU Boulevard South Fort Myers, FL 33965-6565 Phone: (239) 745-4394 Email: ndhumal(a)fgcu.edu

2 2

Node in drain state
by Gestió Servidors 22 Sep '25

22 Sep '25

Hello, I have got a node in "drain" state after finishing a job that was running on it. Log in node reports this information: [...] [2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 59238 [2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask for node: 0xFFF [2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW mask for node: 0xFFF [2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310 [2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin loaded [2025-09-07T11:09:27.007] [59238.batch] debug: auth/munge: init: loaded [2025-09-07T11:09:27.009] [59238.batch] debug: Reading cgroup.conf file /soft/slurm-23.11.0/etc/cgroup.conf [2025-09-07T11:09:27.025] [59238.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded [2025-09-07T11:09:27.025] [59238.batch] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: core enforcement enabled [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: device enforcement enabled [2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfff [2025-09-07T11:09:27.027] [59238.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology Default plugin loaded [2025-09-07T11:09:27.030] [59238.batch] debug: gpu/generic: init: init: GPU Generic plugin loaded [2025-09-07T11:09:27.031] [59238.batch] debug: laying out the 12 tasks on 1 hosts clus09 dist 2 [2025-09-07T11:09:27.031] [59238.batch] debug: close_slurmd_conn: sending 0: No error [2025-09-07T11:09:27.031] [59238.batch] debug: Message thread started pid = 910040 [2025-09-07T11:09:27.031] [59238.batch] debug: Setting slurmstepd(910040) oom_score_adj to -1000 [2025-09-07T11:09:27.031] [59238.batch] debug: spank: opening plugin stack /soft/slurm-23.11.0/etc/plugstack.conf [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-11' [2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-11' [2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', logfile='debug', syslog='fatal' [2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks [2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 2025-09-07T11:09:27 [2025-09-07T11:09:27.098] [59238.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu [2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC : reducing req:255366 to max:159631 [2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2. [2025-09-07T11:09:27.399] [59238.batch] debug: task/affinity: task_p_post_term: affinity StepId=59238.batch, task 0 [2025-09-07T11:09:27.399] [59238.batch] debug: signaling condition [2025-09-07T11:09:27.399] [59238.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2025-09-07T11:09:27.400] [59238.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, job_rc = 512 [2025-09-07T11:09:27.410] [59238.batch] debug: Message thread exited [2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step (rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error) [2025-09-07T11:09:27.411] debug: _rpc_terminate_job: uid = 1000 JobId=59238 [2025-09-07T11:09:27.411] debug: credential for job 59238 revoked [...] "sinfo" shows: [root@login-node ~]# sinfo PARTITION TIMELIMIT AVAIL STATE NODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON node.q* 4:00:00 up drained clus09 0.00 0/0 0/0/1/1 12 0/0/12/12 Kill task faile node.q* 4:00:00 up allocated clus[10-11] 13.82-15.8 2/0 2/0/0/2 12 24/0/0/24 none node.q* 4:00:00 up idle clus[01-06,12] 0.00 0/7 0/7/0/7 12 0/84/0/84 none But it seems there is no error in node... Slurmctld.log in server seems correct, too. Is there any way to reset node to "state=idle" after errors in the same way? Thanks.

3 4

Compute node not responding
by Dhumal, Dr. Nilesh 21 Sep '25

21 Sep '25

Hello, Recently, we installed slum 25 on our cluster. We are not monitoring the user's account. We didn't configure the sql database on the head node. We are running slurmcld on head node and slumd on the compute node. We are getting the following error Head node: compute node not responding. Compute node: 2025-09-19T15:30:23.461] error: Unable to register: Unable to contact slurm controller (connect failure) Do we need to run slumdbd on the head node? I checked the network connection by pinging the compute node from the head node. Do you have any suggestions to resolve this issue? Thanks Nilesh Get Outlook for Android<https://aka.ms/AAb9ysg>

4 5

Re: Node in drain state
by Gestió Servidors 19 Sep '25

19 Sep '25

Hi, After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout" to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is 60s and, always 5 times larger than "MessageTimeout"). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized "UnkillableStepProgram" and if he/she could explain that. Thanks a lot!

3 2

seff for GPU
by Josu Lazkano Lete 18 Sep '25

18 Sep '25

Hello, We are looking to optimize the GPU jobs of our HPC users, is it possible to add GPU info in the seff? It will be great to know how much GPU resources the users request and compare with how much GPU resources they use. Kind regards. [image: Vicomtech] <https://www.vicomtech.org> Josu Lazkano Lete Systems Manager Infrastructures and General Services jlazkano(a)vicomtech.org +(34) 943 30 92 30 The information contained in this electronic message is intended only for the personal and confidential use of the recipients. If you have received this e-mail by mistake, please, notify us and delete it. Avoid printing this message if it is not strictly necessary.

7 9

Scheduling issues with multiple different types of GPU in one partition
by Kevin M. Hildebrand 17 Sep '25

17 Sep '25

We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available. For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs. Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy. We are currently running Slurm 24.05.8. Thanks, Kevin -- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

7 7

New "NOT-state" selection of the sinfo command in Slurm 25.05
by Ole Holm Nielsen 10 Sep '25

10 Sep '25

We just upgraded Slurm to 25.05.3, and I would like to highlight a new functionality of the "sinfo -t <state>" command in 25.05: > The state can be prefixed with '~' which will invert the result of match. We find this "NOT-state" selection useful together with Slurm power saving [1] where any idle nodes may be powered off to save electrical power or cloud costs. For example, the ClusterShell [2] clush command can now skip connections to the powered down nodes like in this example: $ clush -bw@slurmstate:~POWERED_DOWN uname -r You can even configure ClusterShell so that "clush -a" includes all nodes which are *not* powered down. The details are described in the Wiki page [3]. Best regards, Ole [1] https://slurm.schedmd.com/power_save.html [2] https://clustershell.readthedocs.io/en/latest/intro.html [3] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#skipping-powere… -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

1 0

Development RPMs for cgroups v2
by John Hearns 10 Sep '25

10 Sep '25

I am building a version of Slurm on RHEL 9.4 When I run an rpmbuild, he slurm rpm does not contain /usr/lib64/slurm/cgroup_v2.so I have tried to look in the build logs. I suspect I am lacking some development RPMs - but which ones? All hints gratefully received John H

4 5

NHC skips nodes in "mix-" state
by blines＠tgen.org 08 Sep '25

08 Sep '25

The node-mark-offline script skips nodes in the "mix-" state with the error: State "mix-" not yet handled; ignoring. "mix-" seems to be a valid state indicating that a node is mixed and is "planned by the backfill scheduler for a higher priority job". Anyone know the reason that nodes in this state are being skipped? The responsible code appears to be: case "$STATUS" in *'@'*|*'#'*|boot*|*-*|plnd*) # These states aren't handled yet. echo "$0: State \"$STATUS\" not yet handled; ignoring." exit 0 As far as I can tell all the states being skipped are related to nodes being configured or rebooted or powering on/off except *-*. There are other states related to node power that include underscore in the state name; power_down, power_up, pow_dn, pow_up. Should the case statement be *_* instead of *-*?

1 0

Creating /run/user/$UID - for Podman runtime
by John Snowdon 08 Sep '25

08 Sep '25

We are in the middle of implementing an extensive range of container support on our new HPC platform and have decided to offer our users a wide suite of technologies to better support their workloads: * Apptainer * Podman (rootless) * Docker (rootless) We've already got a solution for automated entries in /etc/subuid and /etc/subgid on the head nodes (available here under GPL: https://github.com/megatron-uk/pam_subid), which is where we intend users to build their container images, and building and running containers using Apptainer and Podman in those environments works really well - we're happy that it should take care of 95% of our users needs (Docker is the last few percent....) and not involve giving them any special permissions. If I ssh directly to a compute node, then Podman also works there to run an existing image (podman container run ...). What I'm struggling with now is running Podman under Slurm itself on our compute nodes. It appears as though Podman (in rootless mode) wants to put the majority of its run time / state information under /run/user/$UID ... this is fine on the head nodes which have interactive logins hitting PAM modules which instantiate the /run/user/$UID directories, but not under sbatch/srun which doesn't create them by default. I've not been able to find a single, magical setting which will move all of the Podman state information out from /run/user to another location - there are 3 or 4 settings involved, and even then I still find various bits of Podman want to create stuff under there. Rather than hacking away at getting Podman changed to move all settings and state information elsewhere, it seems like the cleanest solution would just be to put the regular /run/user/$UID directory in place at the point Slurm starts the job instead. What's the best way to get Slurm to create this and clean-up afterwards? Should this be in a prolog/epilog wrapper (e.g. directly calling loginctl) or is it cleaner to get Slurm to trigger the usual PAM session machinery in some manner? John Snowdon Senior Research Infrastructure Engineer (HPC) Research Software Engineering Catalyst Building, Room 2.01 Newcastle University 3 Science Square Newcastle Helix Newcastle upon Tyne NE4 5TG https://hpc.researchcomputing.ncl.ac.uk

5 7

Slurm version 25.05.3 is now available
by Marshall Garey 04 Sep '25

04 Sep '25

We are pleased to announce the availability of Slurm version 25.05.3. This version fixes an issue that prevented deleting a QOS when running with MySQL servers (MariaDB is was unaffected). Please note that the slurmdbd will require MySQL 8.0.4+ or MariaDB 10.0.5+ to function correctly. This version also fixes heterogeneous jobs when TLS is enabled, a logging issue with syslog, and various mild to moderate stability fixes. The full list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Discussion about the --cpus-per-task flag
by Michele Esposito Marzino 04 Sep '25

04 Sep '25

Hi everyone, I would like to ask about the rationale behind the --cpus-per-task flag. While I do understand the meaning and its intended goal, I was wondering how it is managed in practice. I tried inferring this from the source code, but it is quite intricate, and not really straightforward by simply searching for the string match. Does it affect the CPUSET of each task? Who is responsible for setting this? The slurmcltd or slurmd? And how? Or at least, point me where in the code it is handled, so that I can reconstruct the whole process myself. Thanks

1 0

sacctmgr output with and without associations
by Rémy Dernat 04 Sep '25

04 Sep '25

Hi, I have a stupid issue about sacctmgr output and associations. I would like to retrieve my accounts with all the following data : "Account","Clusters","DefaultQOS","Description","Fairshare","Flags","GrpTRESMins","GrpTRES","GrpJobs","GrpSubmitJob",GrpWall", "MaxTRESMins","MaxTRES","MaxJobs","MaxNodes","MaxSubmitJobs","MaxWall","Organization","ParentName","Priority","QosLevel" So my command is the following : sacctmgr list accounts --assoc format="Account","Clusters","DefaultQOS","Description","Fairshare","Flags","GrpTRESMins","GrpTRES","GrpJobs","GrpSubmitJob","GrpWall","MaxTRESMins","MaxTRES","MaxJobs","MaxNodes","MaxSubmitJobs","MaxWall","Organization","ParentName","Priority","QosLevel" -n -P However, I have a lot of outputs corresponding to each User association entry. That would be fine if a basic `|uniq` would remove everything I don't need, but I am stuck with grptres, which can be seen in the main account, but not in the children associations with the Users. On the opposite, if I remove `--assoc`, I don't have anymore grptres, ParentName, or related Qos values. I had to add `where User=""` to my first command to just retrieve my main accounts, without User associations. I dont' know if that is clear enough; is there something I missed with the sacctmgr output ? Thanks, Best regards, Rémy

2 1

Can't remove account
by Matthias Leopold 01 Sep '25

01 Sep '25

Hi, I have a cluster with account foo. I want to delete this account. For every user that had associations in this account I created new associations in a different account and changed the DefaultAccount of the user. Then I deleted all associations in account foo. Now I want to delete account foo: sacctmgr delete account name=foo This gives: Error with request: You can not remove the default account of a user ... Please either remove the accounts listed above from list and resubmit, or change these users' default accounts to remove the account(s). Changes Discarded It names all associations for users that FORMERLY had DefaultAccount foo?? NO user has DefaultAccount foo anymore... I have to mention that I also have partition foo and affected users have an association in their new account with partition foo. But that shouldn't matter IMHO Has anybody seen this? What can I do? Am I getting something wrong? Slurm version is 24.05.7 thanks Matthias

1 0

slurmrestd unable to find plugin rest_auth/jwt
by Jon Marshall 28 Aug '25

28 Aug '25

Hi, I feel like I must have missed something obvious here, so hopefully someone can spot it! We recently updated to Slurm 24.11.6 on our cluster and I made sure to compile --with slurmrestd and --with jwt, as there are a few things I've been meaning to try out with the REST API. From the rpmbuild config.log I get: $ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc/slurm --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-systemdsystemunitdir=/usr/lib/systemd/system --enable-pkgconfig --with-jwt -- configure:21020: checking for jwt.h configure:21020: gcc -c -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -Wl,-z,lazy -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer conftest.c >&5 configure:21020: $? = 0 configure:21020: result: yes configure:21028: checking for jwt_add_header in -ljwt configure:21051: gcc -o conftest -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -Wl,-z,lazy -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -Wl,-z,relro -Wl,-z,lazy conftest.c -ljwt -ldl -lrt -lm -lresolv >&5 configure:21051: $? = 0 configure:21061: result: yes -- config.status:1746: creating src/plugins/auth/jwt/Makefile -- config.status:1746: creating src/slurmrestd/plugins/auth/jwt/Makefile -- ac_cv_header_jwt_h=yes -- ac_cv_lib_jwt_jwt_add_header=yes -- JWT_LDFLAGS='-ljwt' And: configure:26873: checking whether to compile slurmrestd configure:26904: result: yes -- configure:26915: checking for slurmrestd default port configure:26933: result: 6820 And indeed I get a slurmrestd package out of it, which I've successfully installed with no errors. I've created a JWT key and rolled it out to the controller node, but when I go to check auth types I get: slurmrestd -a rest_auth/jwt slurmrestd: fatal: Unable to find plugin: rest_auth/jwt Running an strace on this process shows that it does in fact stat the .so that does in fact exist: stat("/usr/lib64/slurm/auth_jwt.so", {st_mode=S_IFREG|0755, st_size=283728, ...}) = 0 And I'm kind of stumped now. I've checked through all the resolved bug tickets to see if I can spot what is wrong but I'm getting nothing, and I can't find anything in the mailing lists either, other than things I've checked. Has anyone else had similar issues, or can someone see the blindingly obvious thing I'm missing? Cheers Jon Jon Marshall High Performance Computing Specialist IT and Scientific Computing Team Cancer Research UK Cambridge Institute Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE Web<http://www.cruk.cam.ac.uk/> | Facebook<http://www.facebook.com/cancerresearchuk> | Twitter<http://twitter.com/CR_UK> [Description: CRI Logo]<http://www.cruk.cam.ac.uk/>

2 4

Assistance with Node Restrictions and Priority for Users in Floating Partition
by Manisha Yadav 28 Aug '25

28 Aug '25

Dear Team, I have a scenario where I need to provide priority access to multiple users from different projects for only 3 nodes. This means that, at any given time, only 3 nodes can be used in that partition, and if one user is utilizing all 3 nodes, no other user should be able to submit jobs to that partition, or their jobs should remain in the queue. To achieve this, I attempted to use QoS by creating a floating partition with some of the nodes and configuring a QoS with priority. I also set a limit with GrpTRES=gres/gpu=24, given that each node has 8 GPUs, and there are 3 nodes in total. I then attached the QoS to the partition and assigned it to the users who need access. I Also tried MaxTRES=gres/gpu=24 While this setup works as expected in the testing environment for CPUs, it is not functioning as intended in production, and it is not effectively restricting node usage in the partition. Could anyone provide suggestions or guidance on how to properly implement node restrictions along with priority? Thank you for your assistance. Best regards, Manisha Yadav ------------------------------------------------------------------------------------------------------------ [ C-DAC is on Social-Media too. Kindly follow us at: Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] This e-mail is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies and the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email is strictly prohibited and appropriate legal action will be taken. ------------------------------------------------------------------------------------------------------------

4 8

Node Health Check Program
by Paul Edmon 27 Aug '25

27 Aug '25

We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks? -Paul Edmon-

8 16

Slurm 25.05: Retrieving jobs GPU Indices on Heterogeneous Cluster
by David Gauchard 27 Aug '25

27 Aug '25

Hello, I'm running Slurm 25.05 on a heterogeneous cluster (several kind of GPU in the same node) with AutoDetect=nvml and shared mode. When submitting a job with `#SBATCH --gres=gpu:1`, CUDA_VISIBLE_DEVICES is correctly set to a single and valid free GPU index but `scontrol show jobid` does not report any detail about GPU allocation. How can I retrieve the GPU indices assigned to running jobs (reflecting CUDA_VISIBLE_DEVICES) in shared mode? Is there a Slurm command or configuration to enable tracking of these indices on a heterogeneous cluster? The goal is to help automatic choice of the free and appropriate gpu according to the job's needs in order to save bigger gpus for bigger jobs at the time of submitting. Thanks

2 2

Issues with dynamic configless nodes and pam_slurm_adopt
by Leonardo Sala 26 Aug '25

26 Aug '25

Hallo everyone we have recently noticed that when running nodes in configless and dynamic mode, pam_slurm_adopt does not work properly, denying ssh access when the user has a running job. We traced down the issue to the fact that the dynamic nodes are registered with FQDN (e.g. c01.psi.ch), while pam_slurm_adopt uses this routine (https://github.com/SchedMD/slurm/blob/master/src/common/stepd_api.c#L181) to guess the stepd hostname, using the host short name (so, c01). If we use gethostname instead of gethostname_short then it works again. Has anybody experience with this, and is there a way to have dynamic configless nodes registering with the short hostname? Thanks! cheers leo -- Paul Scherrer Institut Dr. Leonardo Sala Group Leader Data Analysis and Research Infrastructure Group Leader Data Curation a.i. Deputy Department Head Science IT Infrastructure and Services department Science IT Infrastructure and Services department (AWI) OBBA/230 Forschungstrasse 111 5232 Villigen PSI Switzerland Phone: +41 56 310 3369 leonardo.sala(a)psi.ch www.psi.ch

1 0

Tips or experiences with Burst Buffers?
by Bjørn-Helge Mevik 21 Aug '25

21 Aug '25

Dear all! We are currently considering setting up the (Lua) Burst Buffer plugin to stage in and out files. Those of you who have done this (or at least tried): do you have any experiences or tips to share? -- Regards, Bjørn-Helge Mevik

2 1

GPFS nvme limit storage per job
by Ratnasamy, Fritz 19 Aug '25

19 Aug '25

We have our new dedicated GPFS storage and were wondering how to accommodate all the users on the local nvme. Is there a way for slurm to enforce a per-job quota for every user so we make sure they don't take all the space? Thanks, *Fritz Ratnasamy*Data Scientist Information Technology

1 0

Nodes Become Invalid Due to Less Total RAM Than Expected
by Xaver Stiensmeier 18 Aug '25

18 Aug '25

Dear slurm-user list, in the past we had a bigger buffer between RealMemory <https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory> and the instance memory. We then discovered that the right way is to activating the *memory option* (SelectTypeParameters=CR_Core_Memory) and setting MemSpecLimit <https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit> to secure RAM for system processes. However, now we run into the problem that due to *on demand scheduling*, we have to setup the slurm.conf in advance by using the RAM values from our flavors as reported by our cloud provider (OpenStack). These RAM values are higher than the RAM values the machines actually have later on: ram_in_mib by openstack total_ram_in_mib by top/slurm 2048 1968 16384 15991 32768 32093 65536 64297 122880 120749 245760 241608 491520 483528 Given that we have to define the slurm.conf in advance, we kinda have to predict how much total ram the instances have once created. Of course I used linear regression to approximate the total ram and then lowered it a bit to have some cushion, but this feels unsafe given that future flavors could differ from that. From the kernel documentation <https://www.kernel.org/doc/Documentation/filesystems/proc.txt> I know that MemTotal is MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) but given that the concrete reserved bits are quite complex <https://witekio.com/blog/cat-proc-meminfo-memtotal/>, I am wondering whether I am doing something wrong as this issue doesn't feel niche enough to be that complicated. --- Anyway, setting the RAM value in the slurm.conf above total ram by predicting too much, leads to errors and nodes being marked as invalid: [2025-08-11T08:19:04.736] debug: Node NODE_NAME has low real_memory size (241607 / 245760) < 100.00% [2025-08-11T08:19:04.736] error: _slurm_rpc_node_registration node=NODE_NAME: Invalid argument or |[2025-07-03T12:57:18.486] error: Setting node NODE_NAME state to INVAL with reason:Low RealMemory (reported:64295 < 100.00% of configured:68719)| |Any hint on how to solve this is much appreciated! | Best regards, Xaver

3 3

Run a healthcheck job on all nodes
by John Hearns 18 Aug '25

18 Aug '25

I may have asked this already. I want to run a healtcheck job on all nodes. I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using nodeset -e Then submit to each node in the list using sbatch -w Is there a cleaner way of doing this? John Hearns

4 5

SLUG 25?
by Dave Lloyd 11 Aug '25

11 Aug '25

Are there plans for a SLUG this year? I didn’t see any announcements on SchedMD’s website nor anything in the list. Thanks, -- Dave Lloyd This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it by mistake. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.

1 0

sinfo history
by Steve Kirk 11 Aug '25

11 Aug '25

Hi, Am I correct in thinking that the history of a *node* as shown by sinfo isn't stored anywhere by Slurm? Interested to know if slurm can tell me historically when a node was draining,drained etc. Regards, Steve

5 6

Slurm version 25.05.2 is now available
by Marshall Garey 07 Aug '25

07 Aug '25

We are pleased to announce the availability of Slurm version 25.05.2. This version fixes a few regressions with x11 forwarding in 25.05 that may prevent applications from launching, adds support for PMIx v6.x, fixes a variety of stability issues, fixes a regression where --tres-per-task was ignored, and fixes additional minor to moderate severity issues. The full list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Building Debian packages - problem
by John Hearns 05 Aug '25

05 Aug '25

I am trying to build Slurm version 24.11.6 on an Ubuntu 22.04 system. I download and unpack the source mk-build-deps -i debian/control (not run as root) debuild -b -uc -us I get this error repeated many times: dpkg-shlibdeps: warning: can't extract name and version from library name 'libslurmfull.so' Has anyone seen similar issues? For information, there is an existing SLurm installation on this syste, with version 21 dpkgs from the Ubuntu repository.

4 4

slurmctld failed to start
by Dhumal, Dr. Nilesh 01 Aug '25

01 Aug '25

Hello, We recently installed slurm-25 on Redhat linux. We failed to start the slurmctld service. sudo systemctl start slurmctld Job for slurmctld.service failed because the control process exited with error code. See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details. sudo systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18 EDT; 1min 3s ago Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 44317 (code=exited, status=1/FAILURE) CPU: 35ms Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon... Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: If munged is up, restart with --num-threads=10 Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Munge encode failed: Failed to connect to "/run/munge/mung> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Failed to create MUNGE Credential Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Couldn't load specified plugin name for auth/munge: Plugin> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: cannot create auth context for auth/munge Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] fatal: failed to initialize auth plugin Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm controller daemon. Here is munge service status. munge.service - MUNGE authentication service Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled; preset: disabled) Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago Docs: man:munged(8) Main PID: 44039 (munged) Tasks: 4 (limit: 606218) Memory: 1.4M CPU: 18ms CGroup: /system.slice/munge.service └─44039 /usr/local/sbin/munged Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication service... Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication service. Any suggestion is apprecieted to resolve this issue. Thanks, Nilesh Dhumal Associate Professor of Chemistry, http://faculty.fgcu.edu/ndhumal/ Coordinator, FGCU Computational Facility, https://www.fgcu.edu/cas/facultyresources/computationalfacility/ SH-430; Department of Chemistry and Physics Florida Gulf Coast University 10501 FGCU Boulevard South Fort Myers, FL 33965-6565 Phone: (239) 745-4394 Email: ndhumal(a)fgcu.edu

2 3

X11 forwarding broken in 25.05.1
by Patryk Bełzak 24 Jul '25

24 Jul '25

Hi, we've recentry upgraded our slurm from 24.11.3 to 25.05.1 and it seems that since the upgrade the ssh X11 forwaring is broken. Quick recap - * on Monday 14'th I performed slurdbd and slurmctld upgrades - X forwarding was still working * on Tuesday 15'th I performed slurmd upgrades - X forwarding stopped working The issue is very hard to determine and it looks like it sits somhere in slurm code. You can submit a job with --x11 and it starts corretly. Xauthority is created, you have all the magic cookies needed, but when you try to start any application, you get error related to permissions I guess, see for yourself: ``` me@sand ~ ssh -X -Y ui [wcss] me@ui.wcss.pl:~ > srun -p lem-cpu-short -A kdm-staff --gres=storage:local:50G -c 12 --mem 12G -t 1:0:0 --x11 --pty /bin/bash [wcss] me@r17ch05b01 ~ > xauth list r17ch05b01.lem.kdm.wcss.pl/unix:91 MIT-MAGIC-COOKIE-1 d82a2efd [wcss] me@r17ch05b01 ~ > xterm xterm: Xt error: Can't open display: localhost:91.0 [wcss] me@r17ch05b01 ~ > date && telnet -4 localhost 6091 || date Wed Jul 23 12:02:39 CEST 2025 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Connection closed by foreign host. Wed Jul 23 12:02:41 CEST 2025 ``` As you can see the connection to port is being dropped/killed after a second or two. Now, it doesn't really matter which flags for ssh you pick (-X or -Y or both). X forwarding is working when you log in as a regular user outside of slurm job. Also if I do ssh localhost inside a job, then I can perform connection to port assigned to $DISPLAY and it isn't dropped - but it doesn't work since $DISPLAY and cookies are being messed up when you perform triple jump and one within same host. Our worker nodes are mostly on el9.5 AlmaLinux. Some are on el8.10 - and there acutally you can do some X forwarding but you must use both -X and -Y (which wasn't the case before slurm upgrade). TLS is disabled in slurm.conf. I am 100% sure that both SSHD and Xorg are properly configured. Has anyone encountered similiar issue? Or any comment from slurm dev team? Best regards Patryk -- Wroclaw Centre for Networking and Supercomputing

3 2

New SchedMD Bike Kits Are Here - Preorder Now Open!
by Victoria Hobson 24 Jul '25

24 Jul '25

We’re excited to announce that SchedMD-branded bike kits are officially available for preorder! Each kit features both the SchedMD and Slurm logos—designed for comfort, performance, and pride.Pick between thoughtfully styled designs for mountain and road biking in each of the following categories: • Jerseys (zip-up or v-neck short sleeve) • Bibs (gravel bib with side and back pockets or switch bib) • Gloves (tech or summer) • Cycling socks with the SchedMD or Slurm logo (the right answer here is to just order both!) Preorders are open now through July 30, 2025. After the store closes, kits will be handcrafted and shipped directly to your door. Please note: delivery will take 8–10 weeks after preorder ends. Want 20% off your order? Reach out to Jess directly (jess(a)schedmd.com) to get an exclusive promo code! Ready to ride in style? Check out the SchedMD bike kits here: https://store.dnacycling.com/schedmd — Thanks for being part of the SchedMD community!

1 0

strange GPU "no resources selected" error
by Paul Raines 23 Jul '25

23 Jul '25

Recently (past 3 days) on our NVIDIA DGX A100 systems running Ubuntu 22.04.5 and slurm 24.11.5 we have had jobs that ask for a gpu, get started by Slurm, but fail to be given a GPU and then fail. In the slurmctld log we see a line like: [2025-07-22T02:46:29.697] error: gres/gpu: job 6919154 node A100-04 no resources selected on the slurmd log I see no errors for the job but there is a line like [2025-07-22T02:46:29.757] [6919154.extern] task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 195:0 rwm(/dev/nvidia0) for all 8 of the GPUs on the node. Other jobs still seem to start up and get a GPU fine. If you look at the job stats one sees: ReqTRES : billing=7,cpu=1,gres/gpu=1,mem=96G,node=1 AllocTRES : billing=3,cpu=1,mem=96G,node=1 showing that even though the gpu was requested, it was not allocated. Occasionly on these boxes (and only these -- my Dell Rocky 8 boxes with GPUS have no problem) we see the nodes go into drain mode with the "res/gpu GRES core specification ... doesn't match socket boundaries." message as per https://support.schedmd.com/show_bug.cgi?id=22498 It does seem to happen after slurmctld restart. I then restart slurmd on the nodes and can resume SLURM on the nodes whenever that happens. Otherwise nothing has changed on these nodes with SLURM config or the OS in over a month. Definition of the nodes are NodeName=A100-[01-04] \ CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 \ ThreadsPerCore=1 RealMemory=1031000 MemSpecLimit=2048 \ TmpDisk=1400000 Feature=amd,epyc,a100 \ Gres=gpu:a100-sxm4-40gb:8 and gres.conf on the nodes is simply AutoDetect=nvml --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

1 0

slurm 25.05.0 : allocates less GPUs than requested by --gres=gpu:count
by Jan Gmys 23 Jul '25

23 Jul '25

Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some jobs requesting --gres=gpu:reqcount GPUs are allocated less than reqcount GPUs if some of the node's GPUs are already in use by other jobs. We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the following test script: > #!/bin/bash > #SBATCH --nodes=1 > #SBATCH --ntasks-per-node=1 > #SBATCH --time=1-00:00:00 > > echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES > echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS > echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE > sleep 10d Submit a job to the node: > > sbatch *--gres=gpu:1* --nodelist=ares-c02-06 job.sh > > Submitted batch job 1950559 > The job starts. Now submit the script again, asking for 2 GPUs > sbatch*--gres=gpu:2* --nodelist=ares-c02-06 job.sh > > Submitted batch job 1950567 This second job should not start as the ressources are not available. Surprisingly, _both jobs are running_ > $ squeue -w ares-c02-06 > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1950567 gpuai job.sh jan.gmys R 1:55 1 ares-c02-06 > 1950559 gpuai job.sh jan.gmys R 2:58 1 ares-c02-06 In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the requested gres/gpu=2. > # sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60 > JobID ReqTRES AllocTRES > ---------- --------------------------------------------- ------------------------------------------------------------ > 1950559 billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1 > 1950567 billing=1,cpu=1,*gres/gpu=2*,mem=4G,node=1 billing=1,cpu=1,gres/gpu:l40s=1,*gres/gpu=1*,mem=4G,node=1 The output of both jobs > $ cat slurm-1950559.out > CUDA_VISIBLE_DEVICES: 0 > SLURM_JOB_GPUS: 0 > SLURM_GPUS_ON_NODE: 1 > > $ cat slurm-1950567.out > CUDA_VISIBLE_DEVICES: 0 > SLURM_JOB_GPUS: 1 > SLURM_GPUS_ON_NODE: 1 CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1. *Environment: * - RHEL 9.1 - slurm 25.05.0 - The GRES configuration seems fine, AutoDetect is off : > # /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v > [2025-07-22T16:44:05.548] GRES: Global*AutoDetect=off*(4) > [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195, > minor 0 > [2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195, > minor 1 > [2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm > [2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm > [2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487 > File=/dev/nvidia[0-1] Links=(null) > Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT and 'scontrol show node' > NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24 > CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:L40S:2 > NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0 > OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30 > 07:36:03 EDT 2022 > RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1 > State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=gpuai > BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54 > LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None > CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2 > AllocTRES= *Debug notes :* - When using the --gpus option instead of --gres everything works as expected : the second job is PENDING (Ressources) - Tried both, ConstrainDevices=on/off in cgroup.conf, same result - The same is happening on other multi-GPU nodes of the cluster - When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are taken, the second (--gres=gpu:1) job waits correctly. - When both GPUs are free, the --gres=gpu:2 job, correctly gets both GPUs : CUDA_VISIBLE_DEVICES: 0,1 - It worked in slurm 22.05.9 (we recently upgraded in two steps -> 23.11.11 -> 25.05.0) - The only viable workaround I see for the moment is to intercept --gres and --gpus-per-node (don't even have it in job_desc I think!! :-/) options in job_submit.lua and force users to use the --gpus option, which seems to works fine. Anyone experienced similar issues? Any idea how to solve this would be highly appreciated. Jan == -- Jan Gmys Ingénieur de recherche Support HPC/IA pour la plateforme MesoNET Mésocentre de Calcul Scientifique Intensif de l'Université de Lille

1 0

Reserve memory for OS
by John Hearns 17 Jul '25

17 Jul '25

Someone remind me please what is the SLurm node parameter to reserve an amount of memory for the OS? Thanks

3 2

Integrating Fabric Modules with SLURM – Best Practices for HPC Networking?
by xoyeyor178＠coderdir.com 15 Jul '25

15 Jul '25

Our team is exploring ways to optimize our HPC cluster’s network performance, particularly for multi-node SLURM workloads. We’re considering Network Devices Expansion Modules Fabric Modules https://serverorbit.com/network-devices/network-expansion-modules/fabric-mo… to enhance scalability and reduce latency between compute nodes. Has anyone successfully deployed Fabric Modules (e.g., Cisco Nexus, Arista, or Mellanox solutions) in a SLURM environment? Specifically: Interconnect Strategies – Any tips for configuring Fabric Modules to handle SLURM’s bursty traffic patterns? Performance Gains – Measurable improvements in job throughput or MPI communication? Troubleshooting – Known conflicts with SLURM’s network topology detection?

1 0

Slurm-wlm depends on X11?
by Jesse Hayward 10 Jul '25

10 Jul '25

Hello group, hopefully a quick one: I am trying to install the suite of standard slurm packages (slurm, munge, slurmdbd, slurm-wlm-basic-plugins) on a compute node. Ubuntu 22.04.05 LTS just upgraded. *However, these packages now bring down x11 (libx11-6 and all of that other garbage in Ubuntu land). * Does slurm actually depend on x11 now? Or do I need to take a look at my apt config and see what's going on over there, instead. Thanks! Jess -- Jesse Hayward Systems Administrator for High Performance Computing Vassar College 845.437.7521 CIS 207

3 6

Is DefCpuPerGPU affected by --ntasks-per-gpu?
by Milad Alizadeh 10 Jul '25

10 Jul '25

I have defined DefCpuPerGPU for a partition but when a job specifies --ntasks-per-gpu then DefCpuPerGPU is ignored and I only get 1 cpu: ``` #SBATCH --gpus=1 #SBATCH --ntasks-per-gpu=1 ``` Is this the expected behaviour? The man page for DefCpuPerGPU says: "This value is used only if the job didn't specify --cpus-per-task and --cpus-per-gpu" which I do not specify, at least not directly. If yes is there a way to set something like DefCpuPerTask?

1 0

Setting memory by assigned node with a plugin
by laddaoui＠telecom-paris.fr 09 Jul '25

09 Jul '25

Hello everyone, I'm writing a job_submit/lua plugin to set the memory allocated to a job depending on which node is assigned by Slurm. However, it appears that the node where the job will run is not available at this step of the submission process. Would a SPANK plugin be able to access this information and modify the memory allocation accordingly? Or is there another approach you would recommend for dynamically setting memory based on the assigned node? Any insights would be greatly appreciated. Best regards, Nacereddine

1 0

Slurm versions 25.05.1 and 24.11.6 are now available
by Marshall Garey 08 Jul '25

08 Jul '25

We are pleased to announce the availability of Slurm versions 25.05.1 and 24.11.6. Changes in 25.05 include the following: * Fix many issues with the TLS Certificate Manager introduced in 25.05 * Optimize account deletion * Fix a bug when reordering the association hierarchy * Fix some issues that cause daemon crashes * Fix a variety of memory leaks Changes in 24.11 include the following: * Fix some issues that cause daemons to crash * Fix some race conditions on shutdown that cause daemons to crash or hang The full list of changes are available in the CHANGELOG for each version: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md https://github.com/SchedMD/slurm/blob/slurm-24.11/CHANGELOG/slurm-24.11.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Energy profile data not recorded in InfluxDB or HDF5
by Xand Meaden 02 Jul '25

02 Jul '25

Hi, Environment: Slurm 23.11.5, Ubuntu Linux 22.04 I'm attempting to setup accounting of job energy data, collected from IPMI (with acct_gather_energy/ipmi) and then written to InfluxDB (with acct_gather_profile/influxdb), using this documentation: https://slurm.schedmd.com/acct_gather.conf.html This is successfully working for non-energy data. However, no Energy data appears in the InfluxDB. If I switch to using HDF5 instead, there is also no Energy data recorded there. Here's the config being used: root@erc-hpc-comp06t:/etc/slurm# grep Acct /etc/slurm/slurm.conf JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup AcctGatherEnergyType=acct_gather_energy/ipmi AcctGatherProfileType=acct_gather_profile/influxdb root@erc-hpc-comp06t:/etc/slurm# cat acct_gather.conf # Energy accounting gatherer configuration EnergyIPMIFrequency=60 EnergyIPMICalcAdjustment=yes EnergyIPMIUsername=xx EnergyIPMIPassword=xxxxx # InfluxDB configuration - record Energy and Task data by default ProfileInfluxDBDefault=Energy,Task # Using stunnel to reach InfluxDB over TLS ProfileInfluxDBHost=localhost:8086 ProfileInfluxDBDatabase=slurmpr_hpc_test ProfileInfluxDBUser=slurmpr_hpc_test_rw ProfileInfluxDBPass=xx ProfileInfluxDBRTPolicy=autogen The Task data is recorded just fine. And I can see that the IPMI plugin is working: root@erc-hpc-comp06t:/etc/slurm# scontrol show node erc-hpc-comp06t | grep CurrentWatts CurrentWatts=85 AveWatts=84 Does anyone have this working? I really hope I'm missing something obvious here! Thanks, Xand -- Xand Meaden | Principal Research Infrastructure Engineer e-Research | King's College London

1 0

sbatch strange behavior with different --nodelist (-w) options
by Xinghong He 01 Jul '25

01 Jul '25

Dear Community, I'm seeing strange behavior from sbatch with different --nodelist (-w) options on my two node cluster. Here are my test scripts: *~/slurm$ cat mpirun.slm#!/bin/bash#SBATCH --job-name=mpirun_2x1#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --exclusivesource /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpivars.shmpirun ./a.sh* *~/slurm$ cat a.sh#!/bin/bashecho "`uname -n` OMPI_COMM_WORLD_RANK = $OMPI_COMM_WORLD_RANK SLURM_NODEID = $SLURM_NODEID SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"* If I do not specify any -w option or if I include both nodes in -w option, I get expected results. *~/slurm$ sbatch mpirun.slmSubmitted batch job 71~/slurm$ cat slurm-71.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]* However, if I specify only one node in -w option, but still want two nodes, I always get one expected result and one unexpected result. The unexpected one will dispatch both MPI tasks to the same node. This one is expected - running two MPI tasks across nodes. *~/slurm$ sbatch -w std-199 mpirun.slmSubmitted batch job 72~/slurm$ cat slurm-72.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]* This one if unexpected - ends up running two MPI tasks on the same node, though SLURM_JOB_NODELIST also gives the correct two nodes. *~/slurm$ sbatch -w std-271 mpirun.slmSubmitted batch job 73ubuntu@bright-anchovy-controller:~/slurm$ cat slurm-73.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-199 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]* I first saw the problem on a larger cluster where I needed to specify both -w and -x options to include and exclude nodes. I then narrowed it down to a two node cluster. I tried adding options like --hostfile or --rankfile, or -npernode, all does not change how the tasks are dispatched to nodes. The problem is repeatable. Here are the tested systems: Slurm 23.02.5 on Ubuntu 22.04.5 LTS Slurm 24.05.1 on Ubuntu 22.04.4 LTS How to make the last one work. I.e., requesting *-w std-271* and make it run on two nodes. I'd appreciate any help! Regards, Xinghong

1 0

extern step does not exit
by william＠signalbox.org.uk 27 Jun '25

27 Jun '25

I am building a new small cluster on Rocky Linux 9 with Slurm 24.11.5. Slurm was compiled with the default options except that we added --with pmix We do use some additional complications - we use NFSv4 with AD authentication (so we use auks), and we use SElinux in enforcing mode. However, I do not currently see any evidence that those relate to the problem that we have; though note that the slurm account is an AD account. All relevant AD accounts have the RFC 2307 attributes (uid, gid etc.) configured. I had also configured pam_slurm_adopt but as that was causing issues I backed out the changes that enable it. It was too hard to troubleshoot without being able to login to the test compute node easily. The issue I am having in testing is that whatever job I launch, when it terminates the extern job step does not terminate. These are the more relevant slurm.conf settings: Epilog=/etc/slurm/epilog.sh JobAcctGatherType=jobacct_gather/cgroup JobCompType=jobcomp/none KillOnBadExit=1 KillWait=30 ProctrackType=proctrack/cgroup Prolog=/etc/slurm/prolog.sh PrologFlags=Alloc,Contain SelectTypeParameters=CR_Core_Memory SelectType=select/cons_tres SlurmUser=slurm TaskPlugin=task/affinity,task/cgroup TaskProlog=/etc/slurm/taskprolog.sh The slurmd.log on the compute node for the job shows a sequence like this (with -vv debug enabled): [2025-06-26T14:57:23.272] select/cons_tres: init: select/cons_tres loaded [2025-06-26T14:57:23.272] select/linear: init: Linear node selection plugin loaded with argument 20 [2025-06-26T14:57:23.272] cred/munge: init: Munge credential signature plugin loaded [2025-06-26T14:57:23.272] [34.extern] debug: auth/munge: init: loaded [2025-06-26T14:57:23.273] [34.extern] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2025-06-26T14:57:23.279] [34.extern] debug: cgroup/v2: init: Cgroup v2 plugin loaded [2025-06-26T14:57:23.290] [34.extern] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2025-06-26T14:57:23.291] [34.extern] debug: CPUs:40 Boards:1 Sockets:2 CoresPerSocket:20 ThreadsPerCore:1 [2025-06-26T14:57:23.291] [34.extern] debug: task/cgroup: init: core enforcement enabled [2025-06-26T14:57:23.291] [34.extern] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:191587M allowed:100%(enforced), swap:0%(enforced), max:100%(191587M) max+swap:100%(383174M) min:30M [2025-06-26T14:57:23.291] [34.extern] debug: task/cgroup: init: memory enforcement enabled [2025-06-26T14:57:23.291] [34.extern] debug: task/cgroup: init: device enforcement enabled [2025-06-26T14:57:23.291] [34.extern] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2025-06-26T14:57:23.291] [34.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffff [2025-06-26T14:57:23.291] [34.extern] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2025-06-26T14:57:23.291] [34.extern] topology/default: init: topology Default plugin loaded [2025-06-26T14:57:23.292] [34.extern] debug: gpu/generic: init: init: GPU Generic plugin loaded [2025-06-26T14:57:23.293] [34.extern] debug: Setting slurmstepd(2666071) oom_score_adj to -1000 [2025-06-26T14:57:23.293] [34.extern] debug: Message thread started pid = 2666071 [2025-06-26T14:57:23.294] [34.extern] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2025-06-26T14:57:23.294] [34.extern] debug: /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf" [2025-06-26T14:57:23.294] [34.extern] debug: spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf [2025-06-26T14:57:23.298] [34.extern] debug: spank: /etc/slurm/plugstack.conf.d/auks.conf:57: Loaded plugin auks.so [2025-06-26T14:57:23.298] [34.extern] debug: SPANK: appending plugin option "auks" [2025-06-26T14:57:23.352] [34.extern] spank-auks: new unique ccache is KCM:71433:49222 <<<< This UID 71433 is the user running the job [2025-06-26T14:57:23.364] [34.extern] spank-auks: user '71433' cred stored in ccache KCM:71433:49222 [2025-06-26T14:57:23.380] [34.extern] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0' [2025-06-26T14:57:23.381] [34.extern] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0' [2025-06-26T14:57:23.381] [34.extern] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0' [2025-06-26T14:57:23.381] [34.extern] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0' [2025-06-26T14:57:23.381] [34.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB job_swappiness=18446744073709551614 [2025-06-26T14:57:23.381] [34.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB job_swappiness=18446744073709551614 [2025-06-26T14:57:23.385] [34.extern] debug: close_slurmd_conn: sending 0: No error [2025-06-26T14:57:24.371] launch task StepId=34.interactive request from UID:71433 GID:70668 HOST:172.17.11.22 PORT:46948 [2025-06-26T14:57:24.372] task/affinity: lllp_distribution: JobId=34 manual binding: mask_cpu,one_thread [2025-06-26T14:57:24.372] debug: Waiting for job 34's prolog to complete [2025-06-26T14:57:24.372] debug: Finished wait for job 34's prolog to complete [2025-06-26T14:57:24.377] select/cons_tres: init: select/cons_tres loaded [2025-06-26T14:57:24.377] select/linear: init: Linear node selection plugin loaded with argument 20 [2025-06-26T14:57:24.377] cred/munge: init: Munge credential signature plugin loaded [2025-06-26T14:57:24.377] [34.interactive] debug: auth/munge: init: loaded [2025-06-26T14:57:24.379] [34.interactive] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2025-06-26T14:57:24.384] [34.interactive] debug: cgroup/v2: init: Cgroup v2 plugin loaded [2025-06-26T14:57:24.403] [34.interactive] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2025-06-26T14:57:24.404] [34.interactive] debug: CPUs:40 Boards:1 Sockets:2 CoresPerSocket:20 ThreadsPerCore:1 [2025-06-26T14:57:24.404] [34.interactive] debug: task/cgroup: init: core enforcement enabled [2025-06-26T14:57:24.404] [34.interactive] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:191587M allowed:100%(enforced), swap:0%(enforced), max:100%(191587M) max+swap:100%(383174M) min:30M [2025-06-26T14:57:24.404] [34.interactive] debug: task/cgroup: init: memory enforcement enabled [2025-06-26T14:57:24.404] [34.interactive] debug: task/cgroup: init: device enforcement enabled [2025-06-26T14:57:24.404] [34.interactive] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2025-06-26T14:57:24.404] [34.interactive] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffff [2025-06-26T14:57:24.404] [34.interactive] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2025-06-26T14:57:24.404] [34.interactive] topology/default: init: topology Default plugin loaded [2025-06-26T14:57:24.404] [34.interactive] debug: gpu/generic: init: init: GPU Generic plugin loaded [2025-06-26T14:57:24.406] [34.interactive] debug: close_slurmd_conn: sending 0: No error [2025-06-26T14:57:24.406] [34.interactive] debug: Message thread started pid = 2666089 [2025-06-26T14:57:24.406] [34.interactive] debug: Setting slurmstepd(2666089) oom_score_adj to -1000 [2025-06-26T14:57:24.407] [34.interactive] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2025-06-26T14:57:24.407] [34.interactive] debug: /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf" [2025-06-26T14:57:24.407] [34.interactive] debug: spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf [2025-06-26T14:57:24.411] [34.interactive] debug: spank: /etc/slurm/plugstack.conf.d/auks.conf:57: Loaded plugin auks.so [2025-06-26T14:57:24.411] [34.interactive] debug: SPANK: appending plugin option "auks" [2025-06-26T14:57:24.461] [34.interactive] spank-auks: new unique ccache is KCM:71433:44156 [2025-06-26T14:57:24.467] [34.interactive] spank-auks: user '71433' cred stored in ccache KCM:71433:44156 [2025-06-26T14:57:24.480] [34.interactive] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0' [2025-06-26T14:57:24.480] [34.interactive] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0' [2025-06-26T14:57:24.480] [34.interactive] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0' [2025-06-26T14:57:24.480] [34.interactive] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0' [2025-06-26T14:57:24.481] [34.interactive] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB job_swappiness=18446744073709551614 [2025-06-26T14:57:24.481] [34.interactive] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB job_swappiness=18446744073709551614 [2025-06-26T14:57:24.501] [34.interactive] warning: restricted to a subset of cpus [2025-06-26T14:57:24.503] [34.interactive] debug: stdin uses a pty object [2025-06-26T14:57:24.504] [34.interactive] debug: init pty size 48:324 [2025-06-26T14:57:24.506] [34.interactive] debug levels are stderr='error', logfile='debug', syslog='quiet' [2025-06-26T14:57:24.507] [34.interactive] debug: IO handler started pid=2666089 [2025-06-26T14:57:24.517] [34.interactive] spank-auks: credential renewer launched (pid=2666107) [2025-06-26T14:57:24.517] [34.interactive] starting 1 tasks [2025-06-26T14:57:24.517] [34.interactive] task 0 (2666108) started 2025-06-26T14:57:24 [2025-06-26T14:57:24.537] [34.interactive] debug: task/affinity: task_p_pre_launch: affinity StepId=34.interactive, task:0 bind:mask_cpu,one_thread [2025-06-26T14:57:24.537] [34.interactive] debug: [job 34] attempting to run slurm task_prolog [/etc/slurm/taskprolog.sh] [2025-06-26T14:57:24.537] [34.interactive] debug: Sending launch resp rc=0 [2025-06-26T14:57:24.563] [34.interactive] debug: export name:TMPDIR:val:/local/scratch/34: [2025-06-26T14:59:42.726] [34.interactive] task 0 (2666108) exited with exit code 0. <<<< Here the interactive job was closed [2025-06-26T14:59:42.727] [34.interactive] spank-auks: all tasks exited, killing credential renewer (pid=2666107) [2025-06-26T14:59:42.729] [34.interactive] debug: task/affinity: task_p_post_term: affinity StepId=34.interactive, task 0 [2025-06-26T14:59:42.729] [34.interactive] debug: signaling condition [2025-06-26T14:59:42.729] [34.interactive] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2025-06-26T14:59:42.729] [34.interactive] debug: Waiting for IO [2025-06-26T14:59:42.729] [34.interactive] debug: Closing debug channel [2025-06-26T14:59:42.729] [34.interactive] debug: IO handler exited, rc=0 [2025-06-26T14:59:42.729] [34.interactive] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2025-06-26T14:59:42.731] [34.interactive] debug: slurm_recv_timeout at 0 of 4, recv zero bytes [2025-06-26T14:59:42.738] [34.interactive] spank-auks: Destroyed ccache KCM:71433:44156 [2025-06-26T14:59:42.740] debug: _rpc_terminate_job: uid = 71755 JobId=34 <<<< That uid 71755 is for the slurm account which does not run any daemons on a compute node [2025-06-26T14:59:42.740] debug: credential for job 34 revoked [2025-06-26T14:59:42.742] [34.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2025-06-26T14:59:42.742] [34.extern] debug: _handle_signal_container for StepId=34.extern uid=71755 signal=18 flag=0x0 [2025-06-26T14:59:42.742] [34.extern] Sent signal 18 to StepId=34.extern [2025-06-26T14:59:42.742] [34.interactive] debug: Handling REQUEST_SIGNAL_CONTAINER [2025-06-26T14:59:42.742] [34.interactive] debug: _handle_signal_container for StepId=34.interactive uid=71755 signal=18 flag=0x0 [2025-06-26T14:59:42.742] [34.interactive] Sent signal 18 to StepId=34.interactive [2025-06-26T14:59:42.743] [34.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2025-06-26T14:59:42.743] [34.extern] debug: _handle_signal_container for StepId=34.extern uid=71755 signal=15 flag=0x0 [2025-06-26T14:59:42.743] [34.extern] Sent signal 15 to StepId=34.extern [2025-06-26T14:59:42.743] [34.interactive] debug: Handling REQUEST_SIGNAL_CONTAINER [2025-06-26T14:59:42.743] [34.interactive] debug: _handle_signal_container for StepId=34.interactive uid=71755 signal=15 flag=0x0 [2025-06-26T14:59:42.743] [34.interactive] Sent signal 15 to StepId=34.interactive [2025-06-26T14:59:42.744] [34.interactive] debug: Handling REQUEST_STATE [2025-06-26T14:59:42.744] [34.interactive] debug: Message thread exited [2025-06-26T14:59:42.744] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:42.763] [34.interactive] done with step [2025-06-26T14:59:42.766] [34.extern] debug: signaling condition [2025-06-26T14:59:42.766] [34.extern] debug: task/affinity: task_p_post_term: affinity StepId=34.extern, task 0 [2025-06-26T14:59:42.766] [34.extern] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2025-06-26T14:59:42.766] [34.extern] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2025-06-26T14:59:42.766] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:42.766] [34.extern] debug: Terminate signal (SIGTERM) received [2025-06-26T14:59:42.768] [34.extern] error: setgroups: Operation not permitted [2025-06-26T14:59:42.768] [34.extern] error: _shutdown_x11_forward: Unable to drop privileges [2025-06-26T14:59:42.771] [34.extern] spank-auks: Destroyed ccache KCM:71433:49222 [2025-06-26T14:59:42.817] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:42.918] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:43.419] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:44.420] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:45.420] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:46.421] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:47.422] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:48.423] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:49.424] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:50.425] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:51.426] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:52.426] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T14:59:53.427] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:03.428] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:12.429] [34.extern] debug: Handling REQUEST_STEP_TERMINATE [2025-06-26T15:00:12.429] [34.extern] debug: _handle_terminate for StepId=34.extern uid=0 [2025-06-26T15:00:12.429] [34.extern] Sent SIGKILL signal to StepId=34.extern [2025-06-26T15:00:12.429] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:12.450] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:12.501] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:12.602] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:13.102] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:14.103] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:14.104] [34.extern] debug: Handling REQUEST_STEP_TERMINATE [2025-06-26T15:00:14.104] [34.extern] debug: _handle_terminate for StepId=34.extern uid=0 [2025-06-26T15:00:14.104] [34.extern] Sent SIGKILL signal to StepId=34.extern [2025-06-26T15:00:15.105] [34.extern] debug: Handling REQUEST_STATE [2025-06-26T15:00:15.105] [34.extern] debug: Handling REQUEST_STEP_TERMINATE . . We are left with a process "slurmstepd: [34.extern]" running, and a matching socket file for job 34 in /var/spool/slurmd; the job shows state 'CG' (Completing). I can clean up by issuing a kill -9 to the slurmstepd 'extern' job on the compute node. I do not know a way to get any logging output from that job step, but it has a file descriptor open to /var/log/slurm/slurmd.log (which is what we configured) so I assume that it is writing some of the above log. In some runs I also see this just as the main task exits (but from the .0 job step, not from .extern) : [2025-06-27T21:00:47.265] [39.0] error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_39/step_0/user/cgroup.free ze' for writing: Permission denied I can see that between reboots, the directories '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_NN' remain, with the /sys/fs/cgroup/system.slice/slurmstepd.scope/job_NN/step_extern folder. The permissions of the files referred to as 'permission denied' seem fine to me: # ls -lhZ /sys/fs/cgroup/system.slice/slurmstepd.scope/job_42/step_extern/user/cgroup. freeze -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jun 27 21:46 /sys/fs/cgroup/system.slice/slurmstepd.scope/job_42/step_extern/user/cgroup. freeze There are no SElinux alerts on the system. I am not sure whether the error messages about setgroups and _shutdown_x11_forward are actually the problem, or just something else being reported. The only system that I have to compare with is running Slurm 19.05 on CentOS 7 and is rather different. I would be interested to know if anyone else has had problems with extern job steps not shutting down. William

1 0

Job information if job is completed
by Gestió Servidors 24 Jun '25

24 Jun '25

Hello, Is there any way to get all information (like submit script or submit node) from a job that is completed? Something like "scontrol show jobid=XXX" when job is "running" or "pending". I need to inspect the submit script of a job but I only know job_id. Thanks.

6 5

Implementing a "soft" wall clock limit
by Davide DelVento 24 Jun '25

24 Jun '25

In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized. Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill. My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know). So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out. Thanks in advance for any suggestions you may provide!

7 14

enforce Qos to users
by laddaoui＠telecom-paris.fr 24 Jun '25

24 Jun '25

Hello everyone, I'm trying to use QoS to enforce resource limits on an association, but I'm having trouble with proper enforcement. I created a QoS with resource limits: ``` sacctmgr add qos qos_gpus flags=denyonlimit,overpartqos maxjobsperuser=4 maxtresperjob=gres/gpu=1 ``` Then I assigned it to an account: ``` sacctmgr modify account name=account-a set qos=qos_gpus defaultqos=qos_gpus systemctl restart slurmctld ``` Users in this account can bypass the QoS limits by explicitly specifying a different QoS when submitting jobs: ``` srun --qos=(normal|qos_gpus) ... ``` Even though I set `defaultqos=qos_gpus`, users can still choose any available QoS and bypass the intended resource limits. My question is: How can I restrict users to only using their assigned QoS and prevent them from specifying other QoS options? Is there a configuration I'm missing to enforce QoS restrictions properly? Best, --- info about my setup slurm version : tested on 23.11.4 and 23.02.7 AccountingStorageEnforce = associations,limits EnforcePartLimits = ALL

2 2

pam_slurm_adopt - ssh to compute nodes not working in slurm 24.11
by Marx, Wolfgang 23 Jun '25

23 Jun '25

Hi, We have defined in our cluster, that users can logon to a compute node from a login node,when they have actual a job running on the compute node. To get this functionallity working, we are usinge the pam_slurm_adopt. As long as we were using slurm 23.05 it all was working well. Now we have upgraded to slurm 24.11.5 and the ssh login to the compute nodes is not longer working. When a job of a user is running on a compute node the ssh of the user to this compute node is refused form the compute node. We have not changed anything in our configuration for the pam_slurm_adopt. Also there is no indication in the release notes that anything has changed regarding pam_slurm_adopt. Is this a known bug in Slurm 24.11 and has anyone facing the same problem. This is a very important feature, especially for our ANSYS users. Thanks Wolfgang Marx Wolfgang Marx, Basisdienste, Gruppe Hochleistungrechnen Technische Universität Darmstadt, Hochschulrechenzentrum Alexanderstraße 2, 64283 Darmstadt Tel.: +496151/16-71158 E-Mail: wolfgang.marx(a)tu-darmstadt.de Web: www.hrz.tu-darmstadt.de

2 1

read-only slurm user
by Hagdorn, Magnus Karl Moritz 23 Jun '25

23 Jun '25

Hi there, we use the slurm prometheus exporter to collect slurm metrics. This works pretty well. However, we have noticed that metrics for some of the restricted partitions are not collected. It occurred to me that this is because we are using an unprivileged user to run the exporter. I am trying to figure out the best way to allow an unprivileged user to collect all metrics. I could add the user to all the relevant groups. However, I am also thinking of using the new slurm exporter that uses the API in which case I need to somehow handle a token. It would be nice to have a readonly user, ie a user that cannot submit any jobs but only read the current state of the cluster. I guess setting MaxJobs and MaxSubmitJobs to 0 would do this. Any other suggestions? Regards magnus -- Dr. Magnus Hagdorn Charité – Universitätsmedizin Berlin Geschäftsbereich IT | Scientific Computing Campus Charité Mitte BALTIC - Invalidenstraße 120/121 10115 Berlin https://www.charite.de HPC Helpdesk: sc-hpc-helpdesk(a)charite.de

1 0

Doc Clarification: Heterogeneous Steps in Heterogeneous Job
by Steffen Christgau 19 Jun '25

19 Jun '25

Hi everybody, I am (along with others) a little bit puzzled by the meaning of a statement in the documentation concerning heterogeneous job steps inside het. jobs. The docs state (https://slurm.schedmd.com/archive/slurm-24.11.5/heterogeneous_jobs.html#het…): > You also cannot request heterogeneous steps from within a heterogeneous job. (A) On a very small Slurm test installation with just two nodes, the following het job that requests het steps (does it, right?!) runs fine: $ cat hetjob-steps.sh #!/bin/bash #SBATCH --mem-per-cpu=2g --nodes=1 --cpus-per-task=8 #SBATCH hetjob #SBATCH --mem-per-cpu=1g --nodes=1 --cpus-per-task=4 srun -l --cpus-per-task=4 nproc : -l --cpus-per-task=2 nproc $ cat slurm-125.out 1: 4 2: 2 3: 2 0: 4 The output looks reasonable and it looks like the above quote does not apply since one can apparently request het steps in a het job. Or am I wrong? The intro in the respective section also gives the impression that het jobsteps are a convenience feature that does not require het jobs, but it does not explicitly exclude the usage of het steps in het jobs: > Slurm version 20.11 introduces the ability to request heterogeneous job steps from within a non-homogeneous job allocation. This allows you the flexibility to have different layouts for job steps without requiring the use of heterogeneous jobs, where having separate jobs for the components may be undesirable. So what does the initial statement (A) actually mean then? Am I just using a lucky example which is actually not supported? A short clarification would be helpful. Thanks in advance Steffen

1 0

Wrong MaxRSS Behavior with cgroup v2 in Slurm
by Guillaume COCHARD 19 Jun '25

19 Jun '25

Hello, We've noticed a recent change in how MaxRSS is reported on our cluster. Specifically, the MaxRSS value for many jobs now often matches the allocated memory, which was not the case previously. It appears this change is due to how Slurm accounts for memory when copying large files, likely as a result of moving from cgroup v1 to cgroup v2. Here’s a simple example: copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~ Two jobs with different memory allocations: Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong. I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3]. According to Slurm’s documentation about cgroup v2, "this plugin provides cgroup's memory.current value from the memory interface, which is not equal to the RSS value provided by procfs. Nevertheless it is the same value that the kernel uses in its OOM killer logic." [2] While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless. Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2? Thanks, Guillaume References: [1] https://github.com/kubernetes/kubernetes/issues/118916 [2] https://slurm.schedmd.com/cgroup_v2.html#limitations [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

3 4

Re: Job information if job is completed
by Gestió Servidors 18 Jun '25

18 Jun '25

Hi, After reconfiguring slurm.conf to add script into the database, I have tried with a "normal" user to get that information (from a owned finished job). However, when I run "sacct -vvvv -B -j 92656", I get: sacct: Jobs Eligible in the time window from Epoch 0 to Wed Jun 18 15:33:03 2025 sacct: debug: Options selected: opt_completion=no opt_dup=no opt_field_list=User,JobID,Jobname%18,partition,state,time,submit,start,end,elapsed,nnodes,ncpus,nodelist, opt_help=0 opt_no_steps=yes opt_whole_hetjob=(null) sacct: debug: accounting_storage/slurmdbd: _connect_dbd_conn: Sent PersistInit msg sacct: debug2: Clusters requested: mycluster sacct: debug2: Userids requested: all sacct: debug2: Jobs requested: sacct: debug2: : 92656 sacct: error: Unknown error 1064 I have read that "sacct: error: Unknown error 1064" could be a error in the MySQL query syntax... but if I run same command "sacct -vvvv -B -j 92656" as root, I get submit script. I suppose problem is in database permissions, so I have added new permissions with "grant all on slurmdb.* TO 'user-sacct'@'localhost' identified by '..user-sacct..' with grant option;" (where user "user-sacct" is a user that acts as coordinator in SLURM). However, if I open a terminal with "user-sacct" and run "sacct -vvvv -B -j 92656", result is the same: "sacct: error: Unknown error 1064" How could give permissions to user-sacct to allow "sacct -B" command? Thanks.

1 0

Re: Job information if job is completed
by Gestió Servidors 18 Jun '25

18 Jun '25

Hello, So if there is no way to get all information from a finished job (specially submit script, not command line, but all content from submit script, like a copy of it), maybe a solution would be run a "prolog" script in each job to run a "cp" from the submit script. However, how could I copy the submit scrip from the prolog script? Because from the prolog script, I can access to some SLURM variables (https://slurm.schedmd.com/prolog_epilog.html) but I don't know how to know what the script is and run a simple "cp" to a destination folder. Thanks.

4 4

Re: [EXT] Re: slurm_pam_adopt module not working
by Ratnasamy, Fritz 17 Jun '25

17 Jun '25

Yes the file exists in /usr/lib64/security/. Best, *Fritz Ratnasamy*Data Scientist Information Technology On Tue, Jun 17, 2025 at 12:17 AM Sean Crosby <scrosby(a)unimelb.edu.au> wrote: > Hi Fritz, > > Does pam_slurm_adopt.so exist in the right location on the node? Normally > on EL hosts it would be /usr/lib64/security/pam_slurm_adopt.so > > # ls /usr/lib64/security/pam_slurm_adopt.so -la > -rwxr-xr-x 1 root root 291936 Mar 4 12:44 > /usr/lib64/security/pam_slurm_adopt.so > > If the file doesn't exist, pam would abnormally exit and not allow anyone > to log in. > > Sean > ------------------------------ > *From:* Ratnasamy, Fritz via slurm-users <slurm-users(a)lists.schedmd.com> > *Sent:* Tuesday, 17 June 2025 14:55 > *To:* Kevin Buckley <kevin.buckley.pawsey.org.au(a)gmail.com> > *Cc:* slurm-users(a)lists.schedmd.com <slurm-users(a)lists.schedmd.com> > *Subject:* [EXT] [slurm-users] Re: slurm_pam_adopt module not working > > * External email: Please exercise caution * > ------------------------------ > Thanks, for some reason I edited the /etc/pam.d/sshd via ansible but that > locked all users to the cluster. That same file works on a different > cluster where the files are pushed via puppet but with ansible it looks > like it is locking all users to the cluster. See below config file sshd: > > auth required pam_sepermit.so > auth substack password-auth > auth include postlogin > # Used with polkit to reauthorize users in remote sessions > -auth optional pam_reauthorize.so prepare > account required pam_nologin.so > ##SLURM > account sufficient pam_slurm_adopt.so action_no_jobs=deny > action_unknown=newest action_adopt_failure=deny action_generic_failure=deny > account sufficient pam_access.so > ##END SLURM > password include password-auth > # pam_selinux.so close should be the first session rule > session required pam_selinux.so close > session required pam_loginuid.so > # pam_selinux.so open should only be followed by sessions to be executed > in the user context > session required pam_selinux.so open env_params > session required pam_namespace.so > session optional pam_keyinit.so force revoke > session include password-auth > session include postlogin > # Used with polkit to reauthorize users in remote sessions > -session optional pam_reauthorize.so prepare > > > > *Fritz Ratnasamy *Data Scientist > Information Technology > > > > > On Wed, Jun 11, 2025 at 8:29 PM Kevin Buckley via slurm-users < > slurm-users(a)lists.schedmd.com> wrote: > > On 2025/06/11 12:46, Ratnasamy, Fritz via slurm-users wrote: > > > > We wanted to block users from ssh to a node where there are no jobs > > running however it looks like users are able to do so. We have installed > > the slurm_pam_adopt_module and set up the slurm.conf accordingly (the > same > > way we did on our first cluster where the pam module denies ssh access > > correctly). > > We saw a similar issue whereby the way that we had PAM setup, meant > that, and here I quote from SchedMD's Daniel Armengod: > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > This is almost certainly caused by the fact that SSH's > `keyboard-interactive` > (not to be confused with `password`) AuthMethod forks a short-lived child > process that is involved in the authentication logic. Slurm's > pam_slurm_adopt > module latches on to that process (which is the wrong one, of course) and > things break in interesting ways from there. > > SSH authmethods `publickey` and `password` do not exhibit this behaviour > as SSH > does not fork a child process to offload the authentication > challenge-response > dialogue to. > > ... > > The key bit here is that in your last test you're forcing > `PreferredAuthentications=password`, which isn't actually the > `keyboard-interactive` AuthMethod that got picked before. > They work differently under the hood, even if as far as the > user is concerned, both methods just ask for a password. > > ... > > In summary: try disabling the `keyboard-interactive` authentication method > in > your compute nodes. pam_slurm_adopt should work correctly now. > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > > Maybe that's also your issue. > > > Daniel did say that SchedMD were going to update their documentation > to make that distinction, and it's effect, more explciit, so I would > expect it to be in the mainstream docs by now. > > HTH > > -- > slurm-users mailing list -- slurm-users(a)lists.schedmd.com > To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com > CAUTION: This email has originated outside of University email systems. > Please do not click links or open attachments unless you recognize the > sender and trust the contents as safe. > > CAUTION: This email has originated outside of University email systems. > Please do not click links or open attachments unless you recognize the > sender and trust the contents as safe. > >

2 1

Re: slurm_pam_adopt module not working
by Ratnasamy, Fritz 17 Jun '25

17 Jun '25

Thanks, for some reason I edited the /etc/pam.d/sshd via ansible but that locked all users to the cluster. That same file works on a different cluster where the files are pushed via puppet but with ansible it looks like it is locking all users to the cluster. See below config file sshd: auth required pam_sepermit.so auth substack password-auth auth include postlogin # Used with polkit to reauthorize users in remote sessions -auth optional pam_reauthorize.so prepare account required pam_nologin.so ##SLURM account sufficient pam_slurm_adopt.so action_no_jobs=deny action_unknown=newest action_adopt_failure=deny action_generic_failure=deny account sufficient pam_access.so ##END SLURM password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session include password-auth session include postlogin # Used with polkit to reauthorize users in remote sessions -session optional pam_reauthorize.so prepare *Fritz Ratnasamy*Data Scientist Information Technology On Wed, Jun 11, 2025 at 8:29 PM Kevin Buckley via slurm-users < slurm-users(a)lists.schedmd.com> wrote: > On 2025/06/11 12:46, Ratnasamy, Fritz via slurm-users wrote: > > > > We wanted to block users from ssh to a node where there are no jobs > > running however it looks like users are able to do so. We have installed > > the slurm_pam_adopt_module and set up the slurm.conf accordingly (the > same > > way we did on our first cluster where the pam module denies ssh access > > correctly). > > We saw a similar issue whereby the way that we had PAM setup, meant > that, and here I quote from SchedMD's Daniel Armengod: > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > This is almost certainly caused by the fact that SSH's > `keyboard-interactive` > (not to be confused with `password`) AuthMethod forks a short-lived child > process that is involved in the authentication logic. Slurm's > pam_slurm_adopt > module latches on to that process (which is the wrong one, of course) and > things break in interesting ways from there. > > SSH authmethods `publickey` and `password` do not exhibit this behaviour > as SSH > does not fork a child process to offload the authentication > challenge-response > dialogue to. > > ... > > The key bit here is that in your last test you're forcing > `PreferredAuthentications=password`, which isn't actually the > `keyboard-interactive` AuthMethod that got picked before. > They work differently under the hood, even if as far as the > user is concerned, both methods just ask for a password. > > ... > > In summary: try disabling the `keyboard-interactive` authentication method > in > your compute nodes. pam_slurm_adopt should work correctly now. > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > > Maybe that's also your issue. > > > Daniel did say that SchedMD were going to update their documentation > to make that distinction, and it's effect, more explciit, so I would > expect it to be in the mainstream docs by now. > > HTH > > -- > slurm-users mailing list -- slurm-users(a)lists.schedmd.com > To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com > CAUTION: This email has originated outside of University email systems. > Please do not click links or open attachments unless you recognize the > sender and trust the contents as safe. > >

2 1

MIG H100 with xeon Intel
by Richard Lefebvre 17 Jun '25

17 Jun '25

I'm having problems with Autodetect=nvml in gres.conf. I get on the controller log the following: error: _check_core_range_matches_sock: gres/gpu GRES autodetected core affinity 16-31 on node node001 doesn't match socket boundaries. (Socket 0 is cores 0-31). Consider setting SlurmdParameters=l3cache_as_socket (recommended) or override this by manually specifying core affinity in gres.conf. I did set l3cache_as_socket in the slurm.conf of the node, but I still get the error on the slurm controler I'm running 24.11.5 on AlmaLinux 9.5 Richard

2 1

How are the results produced by 'seff'?
by Loris Bennett 12 Jun '25

12 Jun '25

Hi, With Slurm 24.11.5 for some jobs I am seeing differences between the memory usage reported by 'seff' and that shown by Prometheus as 'cgroup_memory_rss_bytes' (and ultimately reported by 'jobstats' [1]). Certainly at the University of Delft they seem to feel that the memory usage reported by 'seff' is unreliable [2]. Is that indeed the case? Cheers, Loris Footnotes: [1] https://github.com/PrincetonUniversity/jobstats [2] https://doc.dhpc.tudelft.nl/delftblue/Slurm-trouble-shooting/ -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

1 0

slurm_pam_adopt module not working
by Ratnasamy, Fritz 12 Jun '25

12 Jun '25

Hi, We wanted to block users from ssh to a node where there are no jobs running however it looks like users are able to do so. We have installed the slurm_pam_adopt_module and set up the slurm.conf accordingly (the same way we did on our first cluster where the pam module denies ssh access correctly). We have PrologFlags=contain in slurm.conf and inside /etc/pam.d/sshd, we set the last row to : account sufficient pam_slurm_adopt.so action_no_jobs=deny action_unknown=newest action_adopt_failure=deny action_generic_failure=deny Are we missing anything? Thanks *Fritz Ratnasamy*Data Scientist Information Technology

3 2

CR_CPU used but only cores used
by Adrian Sevcenco 11 Jun '25

11 Jun '25

Hi! i have a weird situation in which only cores are used instead of CPUs this is Alma9/slurm 22.05.9 (the last one from epel) I have: conf.d/resources.conf 9:SelectType=select/cons_res conf.d/resources.conf 11:SelectTypeParameters=CR_CPU conf.d/resources.conf 5:TaskPluginParam=autobind=threads conf.d/resources.conf 2:TaskPlugin=task/affinity,task/cgroup cgroup.conf 2:ConstrainCores=no conf.d/parts_issaf.conf 8:PartitionName=CLUSTER DEFAULT=YES State=UP nodes=DEFAULT DefaultTime=240:00:00 MaxTime=480:00:00 LLN=yes MaxNodes=1 What am i missing? Why are only cores used? Thanks a lot!! Adrian

2 2

2025

2024