We are pleased to announce the availability of Slurm version 23.11.5.
The 23.11.5 release includes some important fixes related to newer
features as well as some database fixes. The most noteworthy fixes
include fixing the sattach command (which only worked for root and
SlurmUser after 23.11.0) and fixing an issue while constructing the new
lineage database entries. This last change will also perform a query
during the upgrade from any prior 23.11 version to fix existing databases.
…
[View More]Slurm can be downloaded from https://www.schedmd.com/downloads.php .
-Tim
> * Changes in Slurm 23.11.5
> ==========================
> -- Fix Debian package build on systems that are not able to query the systemd
> package.
> -- data_parser/v0.0.40 - Emit a warning instead of an error if a disabled
> parser is invoked.
> -- slurmrestd - Improve handling when content plugins rely on parsers
> that haven't been loaded.
> -- Fix old pending jobs dying (Slurm version 21.08.x and older) when upgrading
> Slurm due to "Invalid message version" errors.
> -- Have client commands sleep for progressively longer periods when backed off
> by the RPC rate limiting system.
> -- slurmctld - Ensure agent queue is flushed correctly at shutdown time.
> -- slurmdbd - correct lineage construction during assoc table conversion for
> partition based associations.
> -- Add new RPCs and API call for faster querying of job states from slurmctld.
> -- slurmrestd - Add endpoint '/slurm/{data_parser}/jobs/state'.
> -- squeue - Add `--only-job-state` argument to use faster query of job states.
> -- Make a job requesting --no-requeue, or JobRequeue=0 in the slurm.conf,
> supersede RequeueExit[Hold].
> -- Add sackd man page to the Debian package.
> -- Fix issues with tasks when a job was shrinked more than once.
> -- Fix reservation update validation that resulted in reject of correct
> updates of reservation when the reservation was running jobs.
> -- Fix possible segfault when the backup slurmctld is asserting control.
> -- Fix regression introduced in 23.02.4 where slurmctld was not properly
> tracking the total GRES selected for exclusive multi-node jobs, potentially
> and incorrectly bypassing limits.
> -- Fix tracking of jobs typeless GRES count when multiple typed GRES with the
> same name are also present in the job allocation. Otherwise, the job could
> bypass limits configured for the typeless GRES.
> -- Fix tracking of jobs typeless GRES count when request specification has a
> typeless GRES name first and then typed GRES of different names (i.e.
> --gres=gpu:1,tmpfs:foo:2,tmpfs:bar:7). Otherwise, the job could bypass
> limits configured for the generic of the typed one (tmpfs in the example).
> -- Fix batch step not having SLURM_CLUSTER_NAME filled in.
> -- slurmstepd - Avoid error during `--container` job cleanup about
> RunTimeQuery never being configured. Results in cleanup where job steps not
> fully started.
> -- Fix nodes not being rebooted when using salloc/sbatch/srun "--reboot" flag.
> -- Send scrun.lua in configless mode.
> -- Fix rejecting an interactive job whose extra constraint request cannot
> immediately be satisfied.
> -- Fix regression in 23.11.0 when parsing LogTimeFormat=iso8601_ms that
> prevented milliseconds from being printed.
> -- Fix issue where you could have a gpu allocated as well as a shard on that
> gpu allocated at the same time.
> -- Fix slurmctld crashes when using extra constraints with job arrays.
> -- sackd/slurmrestd/scrun - Avoid memory leak on new unix socket connection.
> -- The failed node field is filled when a node fails but does not time out.
> -- slurmrestd - Remove requiring job script field and job component script
> fields to both be populated in the `POST /slurm/v0.0.40/job/submit`
> endpoint as there can only be one batch step script for a job.
> -- slurmrestd - When job script is provided in '.jobs[].script' and '.script'
> fields, the '.script' field's value will be used in the
> `POST /slurm/v0.0.40/job/submit` endpoint.
> -- slurmrestd - Reject HetJob submission missing or empty batch script for
> first Het component in the `POST /slurm/v0.0.40/job/submit` endpoint.
> -- slurmrestd - Reject job when empty batch script submitted to the
> POST /slurm/v0.0.40/job/submit` endpoint.
> -- Fix pam_slurm and pam_slurm_adopt when using auth/slurm.
> -- slurmrestd - Add 'cores_per_socket' field to
> `POST /slurm/v0.0.40/job/submit` endpoint.
> -- Fix srun and other Slurm commands running within a "configless" salloc when
> salloc itself fetched the config.
> -- Enforce binding with shared gres selection if requested.
> -- Fix job allocation failures when the requested tres type or name ends in
> "gres" or "license".
> -- accounting_storage/mysql - Fix lineage string construction when adding a
> user association with a partition.
> -- Fix sattach command.
> -- Fix ReconfigFlags. Due how reconfig was changed in 23.11, they will also
> be used to influence the slurmctld startup as well.
> -- Fix starting slurmd in configless mode if MUNGE support was disabled.
--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
[View Less]
Hello,
I answer about my question:
* What is the contents of your /etc/slurm/job_submit.lua file?
function slurm_job_submit(job_desc, part_list, submit_uid)
if (job_desc.user_id == 1008) then
slurm.log_info("Trabajo sometido por druiz")
if (job_desc['partition'] == "nodo.q") then
if (job_desc['time_limit'] > 345600)
# 345600 seconds == 4 days
# nodo.q partition …
[View More]has "PartitionName=nodo.q Nodes=clus[01-12] Default=YES MaxTime=04:00:00" configuration
return slurm.FAILURE
end
end
end
return slurm.SUCCESS
end
slurm.log_info("initialized")
return slurm.SUCCESS
* Did you reconfigure slurmctld?
Yes, in the first time, I ran "scontrol reconfigure", but after checking that limits wasn't applyed, I restarted slurmctld daemon
* Check the log file by: grep job_submit /var/log/slurm/slurmctld.log
In my slurmctld.log file in SLURM server, "grep job_submit /var/log/slurm/slurmctld.log" doesn't return anything...
* What is your Slurm version?
23.11.0
Thanks.
[View Less]
Hi Everyone,
We have a SLURM cluster of three different types of nodes. One
partition consists of nodes that have a large number of CPUs, 256 CPUs on
each node.
I'm trying to find out the current CPU allocation on some of those nodes
but part of the information I gathered seems to be incorrect. If I use
"*scontrol
show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 …
[View More]Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65
CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have
been allocated, and get a tally of the allocated CPUs, I can only see 128
CPUs that are effectively allocated on that node, based on the output
of *squeue
--state=R -o "%C %N".* So I don't quite understand why the running jobs on
the nodes account for just 128, and not 256, CPU allocation even though
scontrol reports 100% CPU allocation on the node. Could this be due to some
misconfiguration, or a bug in the SLURM version we're running? We're
running Version=23.02.4. The interesting thing is that we have six nodes
that have similar specs, and all of them show up as allocated in the output
of *sinfo*, but the running jobs on each node account for just 128 CPU
allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly
appreciated.
Thanks,
Muhammad
[View Less]
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
the communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores.
Unfortunately some nodes are in a different network preventing full
internode communication. A network topology and setting TopologyParam
RouteTree have been used to make sure no slurmd communication happens
between nodes on different networks.
In the new Slurm version we see the …
[View More]following issues, which did not appear
in 22.05:
1. slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when
trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
1 0 10.5.2.40:6818 10.5.0.43:58572
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1 0 10.5.2.40:6818 10.5.0.43:58284
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1 0 10.5.2.40:6818 10.5.0.43:58186
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1 0 10.5.2.40:6818 10.5.0.43:58592
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1 0 10.5.2.40:6818 10.5.0.43:58338
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1 0 10.5.2.40:6818 10.5.0.43:58568
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1 0 10.5.2.40:6818 10.5.0.43:58472
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1 0 10.5.2.40:6818 10.5.0.43:58486
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1 0 10.5.2.40:6818 10.5.0.43:58316
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the
node running slurmctld. The nodes can communicate using these IP addresses
just fine.
2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address
already in use
This is probably because of the processes being in CLOSE-WAIT, which can
only be killed using signal -9.
3. We see jobs stuck in completing CG state, probably due to communication
issues between slurmctld and slurmd. The slurmctld sends repeated kill
requests but those do not seem to be acknowledged by the client. This
happens more often in large job arrays, or generally when many jobs start
at the same time. However, this could be just a biased observation (i.e.,
it is more noticeable on large job arrays because there are more jobs to
fail in the first place).
4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user
environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and
can’t load the modules for the software that is needed by their jobs. This
leads to many job failures.
The issue appears to be somewhat similar to the one described at:
https://bugs.schedmd.com/show_bug.cgi?id=18561
In that case the site downgraded the slurmd clients to 22.05 which got rid
of the problems.
We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also
seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
--
Fokke Dijkstra <f.dijkstra(a)rug.nl> <f.dijkstra(a)rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA Groningen, The Netherlands
[View Less]
Hello,
We have a use case in which we need to launch multiple concurrently running MPI applications inside a job allocation. Most supercomputing facilities limit the number of concurrent job steps as they incur an overhead with the global Slurm scheduler. Some frameworks, such as the Flux framework from LLNL, claim to mitigate this issue by starting an instance of their own scheduler inside an allocation, which then acts as the resource manager for the compute nodes in the allocation.
Out of …
[View More]curiosity, I was wondering if there is a fundamental reason behind having a single global scheduler that the srun launch commands must contact to launch job steps. Perhaps it was overkill to develop a ‘hierarhical’ design in which Slurm launches a local job daemon for every allocation that manages resources for that allocation? I would appreciate your insight in understanding more about Slurm’s core design.
Thanks and regards,
Kshitij Mehta
Oak Ridge National Laboratory
[View Less]
Our cluster has developed a strange intermittent behaviour where jobs are being put into a pending state because they aren't passing the AssocGrpCpuLimit, even though the user submitting has enough cpus for the job to run.
For example:
$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION NAME USER ST TIME MIN_MEM MIN_CPU NODELIST(REASON)
799 normal hostname andrewss PD 0:00 2G 5 (AssocGrpCpuLimit)
..so the job isn't running, …
[View More]and it's the only job in the queue, but:
$ sacctmgr list associations part=normal user=andrewss format=Account,User,Partition,Share,GrpTRES
Account User Partition Share GrpTRES
---------- ---------- ---------- --------- -------------
andrewss andrewss normal 1 cpu=5
That user has a limit of 5 CPUs so the job should run.
The weird thing is that this effect is intermittent. A job can hang and the queue will stall for ages but will then suddenly start working and you can submit several jobs and they all work, until one fails again.
The cluster has active nodes and plenty of resource:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 2 idle compute-0-[6-7]
interactive up 1-12:00:00 3 idle compute-1-[0-1,3]
The slurmctld log just says:
[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 InitPrio=4294901720 usec=259
Whilst it's in this state I can run other jobs with core requests of up to 4 and they work, but not 5. It's like slurm is adding one CPU to the request and then denying it.
I'm sure I'm missing something fundamental but would appreciate it if someone could point out what it is!
Thanks
Simon.
[View Less]
Our website has gone through some much needed change and we'd love for
you to explore it!
The new SchedMD.com is equipped with the latest information about
Slurm, your favorite workload manager, and details about SchedMD
services, support, and training offerings.
Toggle through our Industries pages
(https://www.schedmd.com/slurm-industries/) to learn more about how
Slurm can service your specific site needs. Why Slurm?
(https://www.schedmd.com/slurm/why-slurm/) gives you all the basics
around …
[View More]our market-leading scheduler and SchedMD Services
(https://www.schedmd.com/slurm-support/our-services/) addresses all
the ways we can help you optimize your site.
These new web pages also feature access to our Documentation Site, Bug
Site, and Installation Guide. Browse our Events tab to see where we'll
be when, and be sure to register for our Slurm User Group (SLUG) in
Oslo, Norway this fall!
(https://www.schedmd.com/about-schedmd/events/)
SchedMD.com, your one stop shop for all things Slurm. Check it out now!
--
Victoria Hobson
SchedMD LLC
Vice President of Marketing
[View Less]