- slurm-users - lists.schedmd.com

Can SLURM queue different jobs to start concurrently?
by Dan Healy 08 Jul '24

08 Jul '24

Hi there, I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first. Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are *randomly *started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node. Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction. -- Thanks, Daniel Healy

3 2

Re: Can SLURM queue different jobs to start concurrently?
by Lloyd Brown 08 Jul '24

08 Jul '24

I'm confused. Why can't they just use a multi-node job, and have the job script farm out the individual tasks to the various workers through some mechanism (srun, mpirun, ssh, etc.)? AFAIK, there's nothing preventing a job from using resources on multiple hosts. The job just needs to have some way of pushing the work out to those hosts. Lloyd On 7/8/24 14:17, Dan Healy via slurm-users wrote: > Hi there, > > I've received a question from an end user, which I presume the answer > is "No", but would like to ask the community first. > > Scenario: The user wants to create a series of jobs that all need to > start at the same time. Example: there are 10 different executable > applications which have varying CPU and RAM constraints, all of which > need to communicate via TCP/IP. Of course the user could design some > type of idle/statusing mechanism to wait until all jobs are /randomly > /started, then begin execution, but this feels like a waste of > resources. The complete execution of these 10 applications would be > considered a single simulation. The goal would be to distribute these > 10 applications across the cluster and not necessarily require them > all to execute on a single node. > > Is there a good architecture for this using SLURM? If so, please > kindly point me in the right direction. > > -- > Thanks, > > Daniel Healy -- Lloyd Brown HPC Systems Administrator Office of Research Computing Brigham Young University http://rc.byu.edu

1 0

cgroups/v2 plugin rpmbuild issue
by Chris Taylor 06 Jul '24

06 Jul '24

Trying to use rpmbuild on Rocky9 Linux, Slurm 21.08 - I want to build with cgroups/v2 support and have these installed: libbpf-devel.x86_64 2:1.3.0-2.el9 dbus-devel.x86_64 1:1.12.20-8.el9 kernel-headers.x86_64 5.14.0-427.22.1.el9_4 hwloc-devel.x86_64 2.4.1-5.el9 In https://slurm.schedmd.com/cgroup_v2.html it says: Requirements For building cgroup/v2 there are two required libraries checked at configure time... Look at your config.log when configuring to see if they were correctly detected on your system. I don't see any mention of 'ebpf', 'bpf', or 'dbus' when I grep through ~/rpmbuild/BUILD/slurm-21.08.8-2/config.log. Before I noticed that, I installed the RPMS on my cluster and get cgroup errors when I try to run a job or get an allocation. The slurmctld and slurmd processes start up fine on the servers, and I have a generic /etc/slurm/cgroup.conf. It seems like the cgroups plugin isn't getting built right with rpmbuild- how do I troubleshoot? Thanks Chris

2 3

Qlustar HPC Core Stack 24.06
by Roland Fehrenbacher 06 Jul '24

06 Jul '24

Hi all, the Qlustar HPC Core Stack 24.06 update is available. It includes an update to Slurm 23.11.8. Find further details at https://qlustar.com/news/2024/2024-06-29-hpc-core-stack-2406 Enjoy, Roland ------- https://qlustar.com -- 100% Open Source HPC / AI / Storage / Cloud Linux Cluster OS --

1 0

Run a program via Strigger when a node joins the cluster
by Karri Vrkreddy 06 Jul '24

06 Jul '24

Hi, We have a requirement to run a specific program whenever any new node joins the slurm cluster. For this, we have tried using strigger with the following options : "strigger --set --node --up --flags=perm --program=<program>" We see that the trigger is not getting activated always. Checked slurmctld logs for "type=node:up" and noticed it is not always happening. Based on documentation at https://slurm.schedmd.com/strigger.html , "up" flag would mean Trigger an event if the specified node is returned to service from a DOWN state 1. From Slurm Workload Manager - Slurm Power Saving Guide , and the above message does it mean that a node has to be already in DOWN state in the cluster for a trigger with "--up" to get activated ? 2. Are there any other ways beyond strigger to achieve this ? | | | | | | | | | | | Slurm Workload Manager - Slurm Power Saving Guide | | | -- Thanks Karri

1 0

Using sharding
by Ricardo Cruz 05 Jul '24

05 Jul '24

Greetings, There are not many questions regarding GPU sharding here, and I am unsure if I am using it correctly... I have configured it according to the instructions <https://slurm.schedmd.com/gres.html>, and it seems to be configured properly: $ scontrol show node compute01 NodeName=compute01 Arch=x86_64 CoresPerSocket=32 CPUAlloc=48 CPUEfctv=128 CPUTot=128 CPULoad=10.95 AvailableFeatures=(null) ActiveFeatures=(null) * Gres=gpu:8,shard:32* [truncated] When running with gres:gpu everything works perfectly: $ /usr/bin/srun --gres=gpu:2 ls srun: job 192 queued and waiting for resources srun: job 192 has been allocated resources (...) However, when using sharding, it just stays waiting indefinitely: $ /usr/bin/srun --gres=shard:2 ls srun: job 193 queued and waiting for resources The reason it gives for pending is just "Resources": $ scontrol show job 193 JobId=193 JobName=ls UserId=rpcruz(1000) GroupId=rpcruz(1000) MCS_label=N/A Priority=1 Nice=0 Account=account QOS=normal * JobState=PENDING Reason=Resources Dependency=(null)* Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2024-06-28T05:36:51 EligibleTime=2024-06-28T05:36:51 AccrueTime=2024-06-28T05:36:51 StartTime=2024-06-29T18:13:22 EndTime=2024-07-01T18:13:22 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-28T05:37:20 Scheduler=Backfill:* Partition=partition AllocNode:Sid=localhost:47757 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=1031887M,node=1,billing=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=ls WorkDir=/home/rpcruz Power= * TresPerNode=gres/shard:2* Again, I think I have configured it properly - it shows up correctly in scontrol (as shown above). Our setup is pretty simple - I just added shard to /etc/slurm/slurm.conf: GresTypes=gpu,shard NodeName=compute01 Gres=gpu:8,shard:32 [truncated] Our /etc/slurm/gres.conf is also straight-forward: (it works fine for --gres=gpu:1) Name=gpu File=/dev/nvidia[0-7] Name=shard Count=32 Maybe I am just running srun improperly? Shouldn't it just be srun --gres= shard:2 to allocate half of a GPU? (since I am using 32 shards for the 8 gpus, so it's 4 shards per gpu) Thank you very much for your attention, -- Ricardo Cruz - https://rpmcruz.github.io

5 7

Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF
by Robert Kudyba 03 Jul '24

03 Jul '24

In https://support.schedmd.com/show_bug.cgi?id=9282#c6 Tim mentioned this env variable SINGULARITYENV_SLURM_CONF, what is the usage/syntax for it? I can't find any reference to this. I'm running into the same issue mentioned there. Thanks in advance!

2 2

Slurm version 24.05.1 is now available
by Tim Wickberg 27 Jun '24

27 Jun '24

We are pleased to announce the availability of Slurm version 24.05.1. This release addresses a number of minor-to-moderate issues since the 24.05 release was first announced a month ago. Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim > * Changes in Slurm 24.05.1 > ========================== > -- Fix slurmctld and slurmdbd potentially stopping instead of performing a > logrotate when recieving SIGUSR2 when using auth/slurm. > -- switch/hpe_slingshot - Fix slurmctld crash when upgrading from 23.02. > -- Fix "Could not find group" errors from validate_group() when using > AllowGroups with large /etc/group files. > -- Prevent an assertion in debugging builds when triggering log rotation > in a backup slurmctld. > -- Add AccountingStoreFlags=no_stdio which allows to not record the stdio > paths of the job when set. > -- slurmrestd - Prevent a slurmrestd segfault when parsing the crontab field, > which was never usable. Now it explicitly ignores the value and emits a > warning if it is used for the following endpoints: > 'POST /slurm/v0.0.39/job/{job_id}' > 'POST /slurm/v0.0.39/job/submit' > 'POST /slurm/v0.0.40/job/{job_id}' > 'POST /slurm/v0.0.40/job/submit' > 'POST /slurm/v0.0.41/job/{job_id}' > 'POST /slurm/v0.0.41/job/submit' > 'POST /slurm/v0.0.41/job/allocate' > -- mpi/pmi2 - Fix communication issue leading to task launch failure with > "invalid kvs seq from node". > -- Fix getting user environment when using sbatch with "--get-user-env" or > "--export=" when there is a user profile script that reads /proc. > -- Prevent slurmd from crashing if acct_gather_energy/gpu is configured but > GresTypes is not configured. > -- Do not log the following errors when AcctGatherEnergyType plugins are used > but a node does not have or cannot find sensors: > "error: _get_joules_task: can't get info from slurmd" > "error: slurm_get_node_energy: Zero Bytes were transmitted or received" > However, the following error will continue to be logged: > "error: Can't get energy data. No power sensors are available. Try later" > -- sbatch, srun - Set SLURM_NETWORK environment variable if --network is set. > -- Fix cloud nodes not being able to forward to nodes that restarted with new > IP addresses. > -- Fix cwd not being set correctly when running a SPANK plugin with a > spank_user_init() hook and the new "contain_spank" option set. > -- slurmctld - Avoid deadlock during shutdown when auth/slurm is active. > -- Fix segfault in slurmctld with topology/block. > -- sacct - Fix printing of job group for job steps. > -- scrun - Log when an invalid environment variable causes the job submission > to be rejected. > -- accounting_storage/mysql - Fix problem where listing or modifying an > association when specifying a qos list could hang or take a very long time. > -- gpu/nvml - Fix gpuutil/gpumem only tracking last GPU in step. Now, > gpuutil/gpumem will record sums of all GPUS in the step. > -- Fix error in scrontab jobs when using slurm.conf:PropagatePrioProcess=1. > -- Fix slurmctld crash on a batch job submission with "--nodes 0,...". > -- Fix dynamic IP address fanout forwarding when using auth/slurm. > -- Restrict listening sockets in the mpi/pmix plugin and sattach to the > SrunPortRange. > -- slurmrestd - Limit mime types returned from query to 'GET /openapi/v3' to > only return one mime type per serializer plugin to fix issues with OpenAPI > client generators that are unable to handle multiple mime type aliases. > -- Fix many commands possibly reporting an "Unexpected Message Received" when > in reality the connection timed out. > -- Prevent slurmctld from starting if there is not a json serializer present > and the extra_constraints feature is enabled. > -- Fix heterogeneous job components not being signaled with scancel --ctld and > 'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given, > the heterogeneous job components match the given filters, and the > heterogeneous job leader does not match the given filters. > -- Fix regression from 23.02 impeding job licenses from being cleared. > -- Move error to log_flag which made _get_joules_task error to be logged to the > user when too many rpcs were queued in slurmd for gathering energy. > -- For scancel --ctld and the associated rest api endpoints: > 'DELETE /slurm/v0.0.40/jobs' > 'DELETE /slurm/v0.0.41/jobs' > Fix canceling the final array task in a job array when the task is pending > and all array tasks have been split into separate job records. Previously > this task was not canceled. > -- Fix power_save operation after recovering from a failed reconfigure. > -- slurmctld - Skip removing the pidfile when running under systemd. In that > situation it is never created in the first place. > -- Fix issue where altering the flags on a Slurm account (UsersAreCoords) > several limits on the account's association would be set to 0 in > Slurm's internal cache. > -- Fix memory leak in the controller when relaying stepmgr step accounting to > the dbd. > -- Fix segfault when submitting stepmgr jobs within an existing allocation. > -- Added "disable_slurm_hydra_bootstrap" as a possible MpiParams parameter in > slurm.conf. Using this will disable env variable injection to allocations > for the following variables: I_MPI_HYDRA_BOOTSTRAP, > I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS, HYDRA_BOOTSTRAP, > HYDRA_LAUNCHER_EXTRA_ARGS. > -- scrun - Delay shutdown until after start requested. This caused scrun > to never start or shutdown and hung forever when using --tty. > -- Fix backup slurmctld potentially not running the agent when taking over as > the primary controller. > -- Fix primary controller not running the agent when a reconfigure of the > slurmctld fails. > -- slurmd - fix premature timeout waiting for REQUEST_LAUNCH_PROLOG with large > array jobs causing node to drain. > -- jobcomp/{elasticsearch,kafka} - Avoid sending fields with invalid date/time. > -- jobcomp/elasticsearch - Fix slurmctld memory leak from curl usage > -- acct_gather_profile/influxdb - Fix slurmstepd memory leak from curl usage > -- Fix 24.05.0 regression not deleting job hash dirs after MinJobAge. > -- Fix filtering arguments being ignored when using squeue --json. > -- switch/nvidia_imex - Move setup call after spank_init() to allow namespace > manipulation within the SPANK plugin. > -- switch/nvidia_imex - Skip plugin operation if nvidia-caps-imex-channels > device is not present rather than preventing slurmd from starting. > -- switch/nvidia_imex - Skip plugin operation if job_container/tmpfs > is configured due to incompatibility. > -- switch/nvidia_imex - Remove any pre-existing channels when slurmd starts. > -- rpc_queue - Add support for an optional rpc_queue.yaml configuration file.

1 0

SLURM QoS Preemption Not Functioning as Expected
by 木子 27 Jun '24

27 Jun '24

Hi all; I'm testing slurm preempt with PreemptType=preempt/qos and PreemptMode=suspend,gang in slurm.conf. But job with high-qos not preempt the job with low-qos. My configuration followed: slurm version: 23.11.4 test node : 8 CPUS and 8GB RAM QoS Configuration:  Partition Configuration:  Job Submission: I submitted test jobs, one with --qos=low-qos and--partition=low , resources 4 CPUs, 4GB memory, and later, a  job  with --qos=high-qos and  --partition=g1 requesting more resources( 5 CPUs, 5GB memory). Did anyone else encounter this issue or any additional steps I might need to take to correctly enable QoS-based preemption in our SLURM setup. Best regards, Liyar

1 0

scontrol update job=... comment=... fails with "Job is no longer pending"
by Thomas Zeiser 27 Jun '24

27 Jun '24

Dear Slurm community, we are running 23.02.7. For a small fraction of jobs we get "Job is no longer pending execution for job <JOBID>" when running "scontrol update job=JOBID comment=SOME-COMMENT" on an already running job. For the majority (I guess 98%) of running jobs, this works perfectly fine. I also see no reason why updating the comment of a running job should not be allowed. Did anyone else observe this issue or are there explanations why this might be expected? (We abuse the comment field to track which GPUs within a node are assigned to a job as we did not find any option within Slurm to get the GPU IDs of assigned GPUs but we need that information for our monitoring framework.) Best regards, Thomas Zeiser, NHR@FAU

1 0

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users