Hi Everyone,
We have a SLURM cluster of three different types of nodes. One
partition consists of nodes that have a large number of CPUs, 256 CPUs on
each node.
I'm trying to find out the current CPU allocation on some of those nodes
but part of the information I gathered seems to be incorrect. If I use
"*scontrol
show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 …
[View More]Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65
CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have
been allocated, and get a tally of the allocated CPUs, I can only see 128
CPUs that are effectively allocated on that node, based on the output
of *squeue
--state=R -o "%C %N".* So I don't quite understand why the running jobs on
the nodes account for just 128, and not 256, CPU allocation even though
scontrol reports 100% CPU allocation on the node. Could this be due to some
misconfiguration, or a bug in the SLURM version we're running? We're
running Version=23.02.4. The interesting thing is that we have six nodes
that have similar specs, and all of them show up as allocated in the output
of *sinfo*, but the running jobs on each node account for just 128 CPU
allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly
appreciated.
Thanks,
Muhammad
[View Less]
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
the communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores.
Unfortunately some nodes are in a different network preventing full
internode communication. A network topology and setting TopologyParam
RouteTree have been used to make sure no slurmd communication happens
between nodes on different networks.
In the new Slurm version we see the …
[View More]following issues, which did not appear
in 22.05:
1. slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when
trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
1 0 10.5.2.40:6818 10.5.0.43:58572
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1 0 10.5.2.40:6818 10.5.0.43:58284
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1 0 10.5.2.40:6818 10.5.0.43:58186
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1 0 10.5.2.40:6818 10.5.0.43:58592
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1 0 10.5.2.40:6818 10.5.0.43:58338
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1 0 10.5.2.40:6818 10.5.0.43:58568
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1 0 10.5.2.40:6818 10.5.0.43:58472
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1 0 10.5.2.40:6818 10.5.0.43:58486
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1 0 10.5.2.40:6818 10.5.0.43:58316
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the
node running slurmctld. The nodes can communicate using these IP addresses
just fine.
2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address
already in use
This is probably because of the processes being in CLOSE-WAIT, which can
only be killed using signal -9.
3. We see jobs stuck in completing CG state, probably due to communication
issues between slurmctld and slurmd. The slurmctld sends repeated kill
requests but those do not seem to be acknowledged by the client. This
happens more often in large job arrays, or generally when many jobs start
at the same time. However, this could be just a biased observation (i.e.,
it is more noticeable on large job arrays because there are more jobs to
fail in the first place).
4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user
environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and
can’t load the modules for the software that is needed by their jobs. This
leads to many job failures.
The issue appears to be somewhat similar to the one described at:
https://bugs.schedmd.com/show_bug.cgi?id=18561
In that case the site downgraded the slurmd clients to 22.05 which got rid
of the problems.
We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also
seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
--
Fokke Dijkstra <f.dijkstra(a)rug.nl> <f.dijkstra(a)rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA Groningen, The Netherlands
[View Less]
Hello,
We have a use case in which we need to launch multiple concurrently running MPI applications inside a job allocation. Most supercomputing facilities limit the number of concurrent job steps as they incur an overhead with the global Slurm scheduler. Some frameworks, such as the Flux framework from LLNL, claim to mitigate this issue by starting an instance of their own scheduler inside an allocation, which then acts as the resource manager for the compute nodes in the allocation.
Out of …
[View More]curiosity, I was wondering if there is a fundamental reason behind having a single global scheduler that the srun launch commands must contact to launch job steps. Perhaps it was overkill to develop a ‘hierarhical’ design in which Slurm launches a local job daemon for every allocation that manages resources for that allocation? I would appreciate your insight in understanding more about Slurm’s core design.
Thanks and regards,
Kshitij Mehta
Oak Ridge National Laboratory
[View Less]
Our cluster has developed a strange intermittent behaviour where jobs are being put into a pending state because they aren't passing the AssocGrpCpuLimit, even though the user submitting has enough cpus for the job to run.
For example:
$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION NAME USER ST TIME MIN_MEM MIN_CPU NODELIST(REASON)
799 normal hostname andrewss PD 0:00 2G 5 (AssocGrpCpuLimit)
..so the job isn't running, …
[View More]and it's the only job in the queue, but:
$ sacctmgr list associations part=normal user=andrewss format=Account,User,Partition,Share,GrpTRES
Account User Partition Share GrpTRES
---------- ---------- ---------- --------- -------------
andrewss andrewss normal 1 cpu=5
That user has a limit of 5 CPUs so the job should run.
The weird thing is that this effect is intermittent. A job can hang and the queue will stall for ages but will then suddenly start working and you can submit several jobs and they all work, until one fails again.
The cluster has active nodes and plenty of resource:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 2 idle compute-0-[6-7]
interactive up 1-12:00:00 3 idle compute-1-[0-1,3]
The slurmctld log just says:
[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 InitPrio=4294901720 usec=259
Whilst it's in this state I can run other jobs with core requests of up to 4 and they work, but not 5. It's like slurm is adding one CPU to the request and then denying it.
I'm sure I'm missing something fundamental but would appreciate it if someone could point out what it is!
Thanks
Simon.
[View Less]
Our website has gone through some much needed change and we'd love for
you to explore it!
The new SchedMD.com is equipped with the latest information about
Slurm, your favorite workload manager, and details about SchedMD
services, support, and training offerings.
Toggle through our Industries pages
(https://www.schedmd.com/slurm-industries/) to learn more about how
Slurm can service your specific site needs. Why Slurm?
(https://www.schedmd.com/slurm/why-slurm/) gives you all the basics
around …
[View More]our market-leading scheduler and SchedMD Services
(https://www.schedmd.com/slurm-support/our-services/) addresses all
the ways we can help you optimize your site.
These new web pages also feature access to our Documentation Site, Bug
Site, and Installation Guide. Browse our Events tab to see where we'll
be when, and be sure to register for our Slurm User Group (SLUG) in
Oslo, Norway this fall!
(https://www.schedmd.com/about-schedmd/events/)
SchedMD.com, your one stop shop for all things Slurm. Check it out now!
--
Victoria Hobson
SchedMD LLC
Vice President of Marketing
[View Less]
I'm a little late to this party but would love to establish contact with others using slurm in Kubernetes.
I recently joined a research institute in Vienna (IIASA) and I'm getting to grips with slurm and Kubernetes (my previous role was data engineering / fintech). My current setup sounds like what Urban described in this thread, back in Nov 22. It has some rough edges though.
Right now, I'm trying to upgrade to slurm-23.11.4 in Ubuntu 23.10 containers. I'm having trouble with the cgroup/v2 …
[View More]plugin.
Are you still using slurm on K8s Urban? How did your installation work out Hans?
Would either of you be willing to share your experiences?
Regards,
Alan.
[View Less]
Hi all,
We're trying to enable sharding on our compute cluster.
On this cluster:
- ensicompute-1 comes with 1 NVIDIA V100 GPU ;
- ensicompute-13 comes with 3 NVIDIA A40 GPUs ;
- all other nodes (for now, ensicompute-11 and ensicompute-12, but several others will come) come with 3 NVIDIA RTX 6000 GPUs.
To enable sharding, I followed these steps:
1. [slurm.conf] Add "shard" to GresTypes ;
2. [slurm.conf] Add "shard:N" to Gres for each node. For testing purposes, I have set N to 9, so each GPU …
[View More]can execute up to 3 jobs concurrently:
NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
3. [gres.conf] Declare the shards after the definition of the GPUs GRES.
For step 3, I tried different things, leading to different outcomes:
a. Define a global number of shards, for the entire host:
Name=shard Count=9
==> This way, sharding seems to work ok, but all the jobs are executed on GPU#0. If running 12 jobs for example, 9 of them are assigned to GPU#0 and start executing, while 3 of them remain in a pending state. No job is assigned to GPU#1 or GPU#2.
b. Define a per-GPU number of shards, associated to the device file representing the GPU:
Name=shard Count=3 File=/dev/nvidia0
Name=shard Count=3 File=/dev/nvidia1
Name=shard Count=3 File=/dev/nvidia2
==> In this case, the slurmd service fails to start on the compute node. The error message found in /var/log/slurmd.log is "fatal: Invalid GRES record for shard, count does not match File value".
c. Don't define anything about shards in gres.conf.
==> Same behavior than in a.: all jobs are executed on GPU#0.
I attach to this message the full content of the slurm.conf and gres.conf files.
What is the proper way to configure sharding in a cluster with several GPUs per node?
Is there a way to specify how many shards should be allocated to each GPU?
Cheers,
François
=== slurm.conf ===
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=ensimag
SlurmctldHost=nash
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
ReturnToService=2
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# LOGGING AND ACCOUNTING
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu,shard
NodeName=ensicompute-1 Gres=gpu:Tesla:1,shard:3 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
NodeName=ensicompute-13 Gres=gpu:A40:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
=== gres.conf ===
AutoDetect=off
# ensicompute-1
NodeName=ensicompute-1 Name=gpu Type=Tesla File=/dev/nvidia0
NodeName=ensicompute-1 Name=shard Count=3 File=/dev/nvidia0
# ensicompute-11
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia0
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia1
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia2
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia2
# ensicompute-12
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia0
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia1
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia2
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia2
# ensicompute-13
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia0
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia1
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia2
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia2
--
François Broquedis, Ingénieur Service Informatique
Grenoble INP - Ensimag, bureau E208
681 rue de la Passerelle
BP 72, 38402 Saint Martin d'Hères CEDEX
Tél.: +33 (0)4 76 82 72 78
[View Less]