- slurm-users - lists.schedmd.com

Incorrect hyperthreading with Slurm 23.11
by Guillaume COCHARD 28 Mar '24

28 Mar '24

Hello, We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have downgraded our test cluster, the issue is not present with Slurm 22.05 (we had skipped Slurm 23.02). For example, we are working with this node: $ slurmd -C NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128215 It is defined like this in slurm.conf: SelectTypeParameters=CR_CPU_Memory … [View More]TaskPlugin=task/cgroup,task/affinity NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000 NodeSet=htc Feature=htc PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 State=UP LLN=Yes MaxMemPerNode=142000 So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, each of those job would use one 1 CPU. This is the expected behavior. Since the upgrade, we can still launch those 40 jobs, but only the first half of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely idle. When launching 40 stress processes directly in the node without using Slurm all the CPUs are used. When allocating a specific CPU with srun, it works until CPU #19 and then an error occurs even if the allocation includes all the CPUs of the node: #SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 # Works for 0 to 19 srun --cpu-bind=v,map_cpu:19 stress.py # Doesn't work (20 to 39) srun --cpu-bind=v,map_cpu:20 stress.py # Output: srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000FFFFF. srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted This behaviour concerns all our nodes, some of which have been restarted recently and others have not. This causes the jobs to be frequently interrupted, augmenting the difference between the system real time and user+system times and making the jobs slower. We have been peering the documentation but, from what we understand, our configuration seems correct. In particular, as advised by the documentation[1], we don't set up ThreadsPerCore in slurm.conf. Are we missing something, or is there a regression or a change in Slurm configuration since the version 23.11? Thank you, Guillaume [1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore [View Less]

2 1

FairShare priority questions
by Long, Daniel S. 27 Mar '24

27 Mar '24

Hi, I’m trying to set up multifactor priority on our cluster and am having some trouble getting it to behave the way I’d like. My main issues seem to revolve around FairShare. We have multiple projects on our cluster and multiple users in those projects (and some users are in multiple projects, of course). I would like the FairShare to be based only on the project associated with the job; if user A and user B both submit jobs on project C, the FairShare should be identical. However, it looks … [View More]

3 2

Slurm releases move to a six-month cycle
by Tim Wickberg 26 Mar '24

26 Mar '24

Slurm major releases are moving to a six month release cycle. This change starts with the upcoming Slurm 24.05 release this May. Slurm 24.11 will follow in November 2024. Major releases then continue every May and November in 2025 and beyond. There are two main goals of this change: - Faster delivery of newer features and functionality for customers. - "Predictable" release timing, especially for those sites that would prefer to upgrade during an annual system maintenance window. SchedMD … [View More]

1 0

controller backup slurmctld error while takeover
by Miriam Olmi 26 Mar '24

26 Mar '24

Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems fine: both the slurmctld services on the primary and the backup controller are running and if I stop the service on the primary controller after 10s more or less (SlurmctldTimeout = 10 sec) the backup controller takes over. Also, if I run the sinfo or squeue command during the 10s of inactivity, the shell stay pending but it … [View More]

2 5

cgroups_exporter for slurm on rhel9 (cgroups-v2)
by Saluja, Prabhjyot 25 Mar '24

25 Mar '24

Hi All, We are currently trying to set up cgroup_exporter <https://github.com/treydock/cgroup_exporter> for slurm. It's been working smoothly with cgroups.v1 and slurm-22.05.7. However, we're facing some challenges with RHEL-9, slurm-23.11.1 and cgroups.v2. The cgroup_exporter isn't capturing the slurm cgroup job information. I'm reaching out to see if any other sites have managed to make this work. If you're using a different exporter that's working for your site, could you please let us … [View More]

2 1

Slurm version 23.11.5 is now available
by Tim McMullan 21 Mar '24

21 Mar '24

We are pleased to announce the availability of Slurm version 23.11.5. The 23.11.5 release includes some important fixes related to newer features as well as some database fixes. The most noteworthy fixes include fixing the sattach command (which only worked for root and SlurmUser after 23.11.0) and fixing an issue while constructing the new lineage database entries. This last change will also perform a query during the upgrade from any prior 23.11 version to fix existing databases. … [View More]Slurm can be downloaded from https://www.schedmd.com/downloads.php . -Tim > * Changes in Slurm 23.11.5 > ========================== > -- Fix Debian package build on systems that are not able to query the systemd > package. > -- data_parser/v0.0.40 - Emit a warning instead of an error if a disabled > parser is invoked. > -- slurmrestd - Improve handling when content plugins rely on parsers > that haven't been loaded. > -- Fix old pending jobs dying (Slurm version 21.08.x and older) when upgrading > Slurm due to "Invalid message version" errors. > -- Have client commands sleep for progressively longer periods when backed off > by the RPC rate limiting system. > -- slurmctld - Ensure agent queue is flushed correctly at shutdown time. > -- slurmdbd - correct lineage construction during assoc table conversion for > partition based associations. > -- Add new RPCs and API call for faster querying of job states from slurmctld. > -- slurmrestd - Add endpoint '/slurm/{data_parser}/jobs/state'. > -- squeue - Add `--only-job-state` argument to use faster query of job states. > -- Make a job requesting --no-requeue, or JobRequeue=0 in the slurm.conf, > supersede RequeueExit[Hold]. > -- Add sackd man page to the Debian package. > -- Fix issues with tasks when a job was shrinked more than once. > -- Fix reservation update validation that resulted in reject of correct > updates of reservation when the reservation was running jobs. > -- Fix possible segfault when the backup slurmctld is asserting control. > -- Fix regression introduced in 23.02.4 where slurmctld was not properly > tracking the total GRES selected for exclusive multi-node jobs, potentially > and incorrectly bypassing limits. > -- Fix tracking of jobs typeless GRES count when multiple typed GRES with the > same name are also present in the job allocation. Otherwise, the job could > bypass limits configured for the typeless GRES. > -- Fix tracking of jobs typeless GRES count when request specification has a > typeless GRES name first and then typed GRES of different names (i.e. > --gres=gpu:1,tmpfs:foo:2,tmpfs:bar:7). Otherwise, the job could bypass > limits configured for the generic of the typed one (tmpfs in the example). > -- Fix batch step not having SLURM_CLUSTER_NAME filled in. > -- slurmstepd - Avoid error during `--container` job cleanup about > RunTimeQuery never being configured. Results in cleanup where job steps not > fully started. > -- Fix nodes not being rebooted when using salloc/sbatch/srun "--reboot" flag. > -- Send scrun.lua in configless mode. > -- Fix rejecting an interactive job whose extra constraint request cannot > immediately be satisfied. > -- Fix regression in 23.11.0 when parsing LogTimeFormat=iso8601_ms that > prevented milliseconds from being printed. > -- Fix issue where you could have a gpu allocated as well as a shard on that > gpu allocated at the same time. > -- Fix slurmctld crashes when using extra constraints with job arrays. > -- sackd/slurmrestd/scrun - Avoid memory leak on new unix socket connection. > -- The failed node field is filled when a node fails but does not time out. > -- slurmrestd - Remove requiring job script field and job component script > fields to both be populated in the `POST /slurm/v0.0.40/job/submit` > endpoint as there can only be one batch step script for a job. > -- slurmrestd - When job script is provided in '.jobs[].script' and '.script' > fields, the '.script' field's value will be used in the > `POST /slurm/v0.0.40/job/submit` endpoint. > -- slurmrestd - Reject HetJob submission missing or empty batch script for > first Het component in the `POST /slurm/v0.0.40/job/submit` endpoint. > -- slurmrestd - Reject job when empty batch script submitted to the > POST /slurm/v0.0.40/job/submit` endpoint. > -- Fix pam_slurm and pam_slurm_adopt when using auth/slurm. > -- slurmrestd - Add 'cores_per_socket' field to > `POST /slurm/v0.0.40/job/submit` endpoint. > -- Fix srun and other Slurm commands running within a "configless" salloc when > salloc itself fetched the config. > -- Enforce binding with shared gres selection if requested. > -- Fix job allocation failures when the requested tres type or name ends in > "gres" or "license". > -- accounting_storage/mysql - Fix lineage string construction when adding a > user association with a partition. > -- Fix sattach command. > -- Fix ReconfigFlags. Due how reconfig was changed in 23.11, they will also > be used to influence the slurmctld startup as well. > -- Fix starting slurmd in configless mode if MUNGE support was disabled. -- Tim McMullan Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support [View Less]

1 0

Re: Lua script
by Gestió Servidors 21 Mar '24

21 Mar '24

Hello, I answer about my question: * What is the contents of your /etc/slurm/job_submit.lua file? function slurm_job_submit(job_desc, part_list, submit_uid) if (job_desc.user_id == 1008) then slurm.log_info("Trabajo sometido por druiz") if (job_desc['partition'] == "nodo.q") then if (job_desc['time_limit'] > 345600) # 345600 seconds == 4 days # nodo.q partition … [View More]

1 0

Re: Lua script
by Gestió Servidors 20 Mar '24

20 Mar '24

Hello, after adding "EnforcePartLimits=ALL" in slurm.conf and restarting slurmctld daemon, job continues being accepted... so I don't undertand where I'm doing some wrong. My slurm.conf is this: ControlMachine=my_server MailProg=/bin/mail MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmdUser=root AuthType=auth/… [View More]

2 1

Node CPU Allocation and Effective CPU Allocation for Running Jobs
by Muhammad Akhdhor 18 Mar '24

18 Mar '24

Hi Everyone, We have a SLURM cluster of three different types of nodes. One partition consists of nodes that have a large number of CPUs, 256 CPUs on each node. I'm trying to find out the current CPU allocation on some of those nodes but part of the information I gathered seems to be incorrect. If I use "*scontrol show node <node-name>*", I get this for the CPU info: RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 … [View More]

2 2

Slurm suspend preemption not working
by Verma, Nischey (HPC ENG,RAL,LSCI) 15 Mar '24

15 Mar '24

Hi All, we are trying to implement preemption in one of our partitions so we can run priority jobs on it and suspend the ones running on the partition and resume once the priority job is done. We have read through the Slurm documentation and did the configuration, but somehow we can not make it work. We tried other preemption like cancel which works fine, but Suspend isn't working. The jobs with higher priority which should suspend other jobs stays as pending and waiting for resources. We are … [View More]using a QOS to assign priority to the job and also configuring the QOS so it can preempt certain other QOSs. this is the output of our scontrol show config : AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = -Some-server- AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = job_comment,job_env,job_extra,job_script AcctGatherEnergyType = (null) AcctGatherFilesystemType = (null) AcctGatherInterconnectType = (null) AcctGatherNodeFreq = 0 sec AcctGatherProfileType = (null) AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2024-03-15T13:32:05 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = cluster CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = (null) CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = (null) DefMemPerCPU = 3500 DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ALL Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = (null) ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = (null) GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /usr/sbin/nhc InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = (null) JobCompParams = (null) JobCompPort = 0 JobCompType = (null) JobCompUser = root JobContainerType = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxBatchRequeue = 5 MaxDBDMsgs = 20024 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxNodeCount = 6 MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = (null) MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmix_v2 MpiParams = (null) NEXT_JOB_ID = 3626 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = (null) PreemptMode = GANG,SUSPEND PreemptParameters = (null) PreemptType = preempt/qos PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 0 PriorityWeightAssoc = 0 PriorityWeightFairShare = 0 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 5000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /etc/slurm/slurmupdate.sh ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 SchedulerParameters = bf_max_job_user=2 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_tres SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(888) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = -some-server- SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = (null) SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 120 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurm SlurmdSyslogDebug = (null) SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 23.11.1 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendExcStates = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = INFINITE SuspendTimeout = 30 sec SwitchParameters = (null) SwitchType = (null) TaskEpilog = (null) TaskPlugin = task/cgroup,task/affinity TaskPluginParam = none TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/default TrackWCKey = No TreeWidth = 16 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupMountpoint = /sys/fs/cgroup CgroupPlugin = autodetect ConstrainCores = yes ConstrainDevices = yes ConstrainRAMSpace = yes ConstrainSwapSpace = no EnableControllers = no IgnoreSystemd = no IgnoreSystemdOnFailure = no MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinRAMSpace = 30 MB MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) the partition is configured like: PartitionName=test-partition AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=vserv-[275-277] PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=32,mem=304933M,node=4,billing=32 Our QOS looks like this: | Name | Priority | GraceTime | Preempt | PreemptMode | Flags | UsageFactor | MaxTRESPU | MaxJobsPU | MaxSubmitPU |-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|----------- | normal | 50 | 00:00:00 | | cluster | | 1.000000 | | 20 | 50 | preempter | 1000 | 00:00:00 | preempted | gang,suspe+| | 1.000000 | | | | preempted | 0 | 00:00:00 | | gang,suspe+ | OverPartQOS | 1.000000 | cpu=100,node=10 | | I can provide more configs if needed. Do you guys see anything strange? Or any property to be set? This is the state of the queue: |JOBID | QOS | ST | TIME | NODELIST(REASON) | PARTITION | PRIORITY | | ---------------- ----------- --------- ----------- ---------------------- ----------------- ----------| | 3629 | preempted | PD | 0:00 | (Resources) | test-partition | 1 | | 3627 | preempted | R | 0:20 | vserv-276 | test-partition | 1 | | 3628 | preempted | R | 0:20 | vserv-277 | test-partition | 1 | | 3626 | preempted | R | 0:27 | vserv-275 | test-partition | 1 | | 3630 | preempter | PD | 0:00 | (Resources) | test-partition | 5000 | any advice is welcomed. Regards, Nischey Verma Nischey Verma [View Less]

3 2

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users