Hello,
We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have downgraded our test cluster, the issue is not present with Slurm 22.05 (we had skipped Slurm 23.02).
For example, we are working with this node:
$ slurmd -C
NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128215
It is defined like this in slurm.conf:
SelectTypeParameters=CR_CPU_Memory
…
[View More]TaskPlugin=task/cgroup,task/affinity
NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000
NodeSet=htc Feature=htc
PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 State=UP LLN=Yes MaxMemPerNode=142000
So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, each of those job would use one 1 CPU. This is the expected behavior.
Since the upgrade, we can still launch those 40 jobs, but only the first half of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely idle. When launching 40 stress processes directly in the node without using Slurm all the CPUs are used.
When allocating a specific CPU with srun, it works until CPU #19 and then an error occurs even if the allocation includes all the CPUs of the node:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
# Works for 0 to 19
srun --cpu-bind=v,map_cpu:19 stress.py
# Doesn't work (20 to 39)
srun --cpu-bind=v,map_cpu:20 stress.py
# Output:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000FFFFF.
srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
This behaviour concerns all our nodes, some of which have been restarted recently and others have not. This causes the jobs to be frequently interrupted, augmenting the difference between the system real time and user+system times and making the jobs slower. We have been peering the documentation but, from what we understand, our configuration seems correct. In particular, as advised by the documentation[1], we don't set up ThreadsPerCore in slurm.conf.
Are we missing something, or is there a regression or a change in Slurm configuration since the version 23.11?
Thank you,
Guillaume
[1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore
[View Less]
Hi,
I’m trying to set up multifactor priority on our cluster and am having some trouble getting it to behave the way I’d like. My main issues seem to revolve around FairShare.
We have multiple projects on our cluster and multiple users in those projects (and some users are in multiple projects, of course). I would like the FairShare to be based only on the project associated with the job; if user A and user B both submit jobs on project C, the FairShare should be identical. However, it looks …
[View More]like the FairShare is based on both the project and the user. Is there a way to get the behavior I’m looking for?
Thanks for any help you can provide.
[View Less]
Slurm major releases are moving to a six month release cycle. This
change starts with the upcoming Slurm 24.05 release this May. Slurm
24.11 will follow in November 2024. Major releases then continue every
May and November in 2025 and beyond.
There are two main goals of this change:
- Faster delivery of newer features and functionality for customers.
- "Predictable" release timing, especially for those sites that would
prefer to upgrade during an annual system maintenance window.
SchedMD …
[View More]will be adjusting our handling of backwards-compatibility within
Slurm itself, and how SchedMD's support services will handle older releases.
For the 24.05 release, Slurm will still only support upgrading from (and
mixed-version operations with) the prior two releases (23.11, 23.02).
Starting with 24.11, Slurm will start supporting upgrades from the prior
three releases (24.05, 23.11, 23.02).
SchedMD's Slurm Support has been built around an 18-month cycle. This
18-month cycle has traditionally covered the current stable release,
plus one prior major releases. With the increase in release frequency
this support window will now cover to the current stable release, plus
two prior major releases.
The blog post version of this announcement includes a table that
outlines the updated support lifecycle:
https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle/
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
[View Less]
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems fine: both the
slurmctld services on the
primary and the backup controller are running and if I stop the service
on the primary controller
after 10s more or less (SlurmctldTimeout = 10 sec) the backup controller
takes over.
Also, if I run the sinfo or squeue command during the 10s of inactivity,
the shell stay pending
but it …
[View More]recover perfectly after the time needed by the backup controller
to take control and it
works the same when the primary controller is back.
Unfortunately, if I try to do the same test while a job is running there
are two different
behaviors depending on the initial scenario.
1st scenario:
Both the primary and backup controller are fine. I launch a batch script
and I verify the script
is running with sinfo and squeue. While the script is still running I
stop the service on the
primary controller with success but at this point everything gets crazy:
on the backup controller in the slurmctld service log I find the
following errors:
slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby
mode
slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in
standby mode
slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby
mode
slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in
standby mode
slurmctld: error: slurm_accept_msg_conn poll: Bad address
slurmctld: error: slurm_accept_msg_conn poll: Bad address
and the commands sinfo and squeue are Unable to contact slurm controller
(connect failure).
2nd scenario:
the primary controller is stopped and I launch a batch job while the
backup controller
is the only one working. While the job is running, I restart the
slurmctld service on the primary
controller. In this case the primary controller takes over immediately:
everything is smooth
and safe and the sinfo and squeue commands continue to work perfectly.
What might be the problem?
Many thanks in advance!
Miriam
[View Less]
Hi All,
We are currently trying to set up cgroup_exporter
<https://github.com/treydock/cgroup_exporter> for slurm. It's been working
smoothly with cgroups.v1 and slurm-22.05.7. However, we're facing some
challenges with RHEL-9, slurm-23.11.1 and cgroups.v2.
The cgroup_exporter isn't capturing the slurm cgroup job information. I'm
reaching out to see if any other sites have managed to make this work. If
you're using a different exporter that's working for your site, could you
please let us …
[View More]know? Thanks!
Regards,
Singh
[View Less]
We are pleased to announce the availability of Slurm version 23.11.5.
The 23.11.5 release includes some important fixes related to newer
features as well as some database fixes. The most noteworthy fixes
include fixing the sattach command (which only worked for root and
SlurmUser after 23.11.0) and fixing an issue while constructing the new
lineage database entries. This last change will also perform a query
during the upgrade from any prior 23.11 version to fix existing databases.
…
[View More]Slurm can be downloaded from https://www.schedmd.com/downloads.php .
-Tim
> * Changes in Slurm 23.11.5
> ==========================
> -- Fix Debian package build on systems that are not able to query the systemd
> package.
> -- data_parser/v0.0.40 - Emit a warning instead of an error if a disabled
> parser is invoked.
> -- slurmrestd - Improve handling when content plugins rely on parsers
> that haven't been loaded.
> -- Fix old pending jobs dying (Slurm version 21.08.x and older) when upgrading
> Slurm due to "Invalid message version" errors.
> -- Have client commands sleep for progressively longer periods when backed off
> by the RPC rate limiting system.
> -- slurmctld - Ensure agent queue is flushed correctly at shutdown time.
> -- slurmdbd - correct lineage construction during assoc table conversion for
> partition based associations.
> -- Add new RPCs and API call for faster querying of job states from slurmctld.
> -- slurmrestd - Add endpoint '/slurm/{data_parser}/jobs/state'.
> -- squeue - Add `--only-job-state` argument to use faster query of job states.
> -- Make a job requesting --no-requeue, or JobRequeue=0 in the slurm.conf,
> supersede RequeueExit[Hold].
> -- Add sackd man page to the Debian package.
> -- Fix issues with tasks when a job was shrinked more than once.
> -- Fix reservation update validation that resulted in reject of correct
> updates of reservation when the reservation was running jobs.
> -- Fix possible segfault when the backup slurmctld is asserting control.
> -- Fix regression introduced in 23.02.4 where slurmctld was not properly
> tracking the total GRES selected for exclusive multi-node jobs, potentially
> and incorrectly bypassing limits.
> -- Fix tracking of jobs typeless GRES count when multiple typed GRES with the
> same name are also present in the job allocation. Otherwise, the job could
> bypass limits configured for the typeless GRES.
> -- Fix tracking of jobs typeless GRES count when request specification has a
> typeless GRES name first and then typed GRES of different names (i.e.
> --gres=gpu:1,tmpfs:foo:2,tmpfs:bar:7). Otherwise, the job could bypass
> limits configured for the generic of the typed one (tmpfs in the example).
> -- Fix batch step not having SLURM_CLUSTER_NAME filled in.
> -- slurmstepd - Avoid error during `--container` job cleanup about
> RunTimeQuery never being configured. Results in cleanup where job steps not
> fully started.
> -- Fix nodes not being rebooted when using salloc/sbatch/srun "--reboot" flag.
> -- Send scrun.lua in configless mode.
> -- Fix rejecting an interactive job whose extra constraint request cannot
> immediately be satisfied.
> -- Fix regression in 23.11.0 when parsing LogTimeFormat=iso8601_ms that
> prevented milliseconds from being printed.
> -- Fix issue where you could have a gpu allocated as well as a shard on that
> gpu allocated at the same time.
> -- Fix slurmctld crashes when using extra constraints with job arrays.
> -- sackd/slurmrestd/scrun - Avoid memory leak on new unix socket connection.
> -- The failed node field is filled when a node fails but does not time out.
> -- slurmrestd - Remove requiring job script field and job component script
> fields to both be populated in the `POST /slurm/v0.0.40/job/submit`
> endpoint as there can only be one batch step script for a job.
> -- slurmrestd - When job script is provided in '.jobs[].script' and '.script'
> fields, the '.script' field's value will be used in the
> `POST /slurm/v0.0.40/job/submit` endpoint.
> -- slurmrestd - Reject HetJob submission missing or empty batch script for
> first Het component in the `POST /slurm/v0.0.40/job/submit` endpoint.
> -- slurmrestd - Reject job when empty batch script submitted to the
> POST /slurm/v0.0.40/job/submit` endpoint.
> -- Fix pam_slurm and pam_slurm_adopt when using auth/slurm.
> -- slurmrestd - Add 'cores_per_socket' field to
> `POST /slurm/v0.0.40/job/submit` endpoint.
> -- Fix srun and other Slurm commands running within a "configless" salloc when
> salloc itself fetched the config.
> -- Enforce binding with shared gres selection if requested.
> -- Fix job allocation failures when the requested tres type or name ends in
> "gres" or "license".
> -- accounting_storage/mysql - Fix lineage string construction when adding a
> user association with a partition.
> -- Fix sattach command.
> -- Fix ReconfigFlags. Due how reconfig was changed in 23.11, they will also
> be used to influence the slurmctld startup as well.
> -- Fix starting slurmd in configless mode if MUNGE support was disabled.
--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
[View Less]
Hello,
I answer about my question:
* What is the contents of your /etc/slurm/job_submit.lua file?
function slurm_job_submit(job_desc, part_list, submit_uid)
if (job_desc.user_id == 1008) then
slurm.log_info("Trabajo sometido por druiz")
if (job_desc['partition'] == "nodo.q") then
if (job_desc['time_limit'] > 345600)
# 345600 seconds == 4 days
# nodo.q partition …
[View More]has "PartitionName=nodo.q Nodes=clus[01-12] Default=YES MaxTime=04:00:00" configuration
return slurm.FAILURE
end
end
end
return slurm.SUCCESS
end
slurm.log_info("initialized")
return slurm.SUCCESS
* Did you reconfigure slurmctld?
Yes, in the first time, I ran "scontrol reconfigure", but after checking that limits wasn't applyed, I restarted slurmctld daemon
* Check the log file by: grep job_submit /var/log/slurm/slurmctld.log
In my slurmctld.log file in SLURM server, "grep job_submit /var/log/slurm/slurmctld.log" doesn't return anything...
* What is your Slurm version?
23.11.0
Thanks.
[View Less]
Hi Everyone,
We have a SLURM cluster of three different types of nodes. One
partition consists of nodes that have a large number of CPUs, 256 CPUs on
each node.
I'm trying to find out the current CPU allocation on some of those nodes
but part of the information I gathered seems to be incorrect. If I use
"*scontrol
show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 …
[View More]Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65
CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have
been allocated, and get a tally of the allocated CPUs, I can only see 128
CPUs that are effectively allocated on that node, based on the output
of *squeue
--state=R -o "%C %N".* So I don't quite understand why the running jobs on
the nodes account for just 128, and not 256, CPU allocation even though
scontrol reports 100% CPU allocation on the node. Could this be due to some
misconfiguration, or a bug in the SLURM version we're running? We're
running Version=23.02.4. The interesting thing is that we have six nodes
that have similar specs, and all of them show up as allocated in the output
of *sinfo*, but the running jobs on each node account for just 128 CPU
allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly
appreciated.
Thanks,
Muhammad
[View Less]