If you run "scontrol show jobid <jobid>" of your pending job with the "(Resources)" tag you may see more about what is unavailable to your job. Slurm default configs can cause an entire compute node of resources to be "allocated" to a running job regardless of whether it needs all of them or not so you may need to alter one or both of the following settings to allow more than one job to run on a single node at once. You'll find these in your slurm.conf. Don't forget to "scontrol reconf"…
[View More] and even potentially restart both "slurmctld" & "slurmd" on your nodes if you do end up making changes.
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
I hope this helps.
Kind regards,
Jason
----
Jason Macklin
Manager Cyberinfrastructure, Research Cyberinfrastructure
860.837.2142 t | 860.202.7779 m
jason.macklin(a)jax.org
The Jackson Laboratory
Maine | Connecticut | California | Shanghai
www.jax.org<http://www.jax.org>
The Jackson Laboratory: Leading the search for tomorrow's cures
________________________________
From: slurm-users <slurm-users-bounces(a)lists.schedmd.com> on behalf of slurm-users-request(a)lists.schedmd.com <slurm-users-request(a)lists.schedmd.com>
Sent: Thursday, January 18, 2024 9:46 AM
To: slurm-users(a)lists.schedmd.com <slurm-users(a)lists.schedmd.com>
Subject: [BULK] slurm-users Digest, Vol 75, Issue 26
Send slurm-users mailing list submissions to
slurm-users(a)lists.schedmd.com
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-request(a)lists.schedmd.com
You can reach the person managing the list at
slurm-users-owner(a)lists.schedmd.com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."
Today's Topics:
1. Re: Need help with running multiple instances/executions of a
batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
(Baer, Troy)
----------------------------------------------------------------------
Message: 1
Date: Thu, 18 Jan 2024 14:46:48 +0000
From: "Baer, Troy" <troy(a)osc.edu>
To: Slurm User Community List <slurm-users(a)lists.schedmd.com>
Subject: Re: [slurm-users] Need help with running multiple
instances/executions of a batch script in parallel (with NVIDIA HGX
A100 GPU as a Gres)
Message-ID:
<CH0PR01MB6924127AF471DED69151805BCF712(a)CH0PR01MB6924.prod.exchangelabs.com>
Content-Type: text/plain; charset="utf-8"
Hi Hafedh,
Your job script has the sbatch directive ??gpus-per-node=4? set. I suspect that if you look at what?s allocated to the running job by doing ?scontrol show job <jobid>? and looking at the TRES field, it?s been allocated 4 GPUs instead of one.
Regards,
--Troy
From: slurm-users <slurm-users-bounces(a)lists.schedmd.com> On Behalf Of Kherfani, Hafedh (Professional Services, TC)
Sent: Thursday, January 18, 2024 9:38 AM
To: Slurm User Community List <slurm-users(a)lists.schedmd.com>
Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias, Thanks both for your answers. I changed the ?#SBATCH --gres=gpu:?4? directive (in the batch script) with ?#SBATCH --gres=gpu:?1? as you suggested, but it didn?t make a difference, as running
Hi Noam and Matthias,
Thanks both for your answers.
I changed the ?#SBATCH --gres=gpu:4? directive (in the batch script) with ?#SBATCH --gres=gpu:1? as you suggested, but it didn?t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state ?
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:1 # <<<< Changed from ?4? to ?1?
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j
#SBATCH --error=gpu_job_error.%j
hostname
date
sleep 40
pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 217
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 218
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 219
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
219 gpu gpu-job slurmtes PD 0:00 1 (Priority)
218 gpu gpu-job slurmtes PD 0:00 1 (Resources)
217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01
Basically I?m seeking for some help/hints on how to tell Slurm, from the batch script for example: ?I want only 1 or 2 GPUs to be used/consumed by the job?, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
Hafedh
From: slurm-users <slurm-users-bounces(a)lists.schedmd.com<mailto:slurm-users-bounces@lists.schedmd.com>> On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Sent: jeudi 18 janvier 2024 2:30 PM
To: Slurm User Community List <slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose(a)mindcode.de<mailto:m.loose@mindcode.de>> wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management<https://urldefense.com/v3/__https:/slurm.schedmd.com/gres.html*MPS_Manageme…>
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
[View Less]
Hello all,
Is there an env variable in SLURM to tell where the slurm.conf is?
We would like to have on the same client node, 2 type of possible submissions to address 2 different cluster.
Thanks in advance,
Christine
Hi,
In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 whereas the cluster has only four GPUs per node each. It is a parallel job. Here are some relevant field printout:
AllocCPUS 30
AllocGRES gpu:6
AllocTRES billing=30,cpu=30,gres/gpu=6,node=3
CPUTime 1-01:23:00
CPUTimeRAW 91380
Elapsed 00:50:46
…
[View More]JobID 20073
JobIDRaw 20073
JobName simple_cuda
NCPUS 30
NGPUS 6.0
What happened in this case? This job was asking for 3 nodes, 10 core per node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the entire job, or six GPUs per node? Per the description in https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic resources required per node”. So it is illogical to request six GPUs per node. So what happened? Did SLURM quietly ignore the request and grant just one, or grant the max number (4)? Because apparently the job ran without error.
Wirawan Purwanto
Computational Scientist, HPC Group
Information Technology Services
Old Dominion University
Norfolk, VA 23529
[View Less]
Dear All,
I tried to implement a strict limit on the GrpTRESMins for
each user. The effect I'm trying to achieve is that after the
limit of GPU minutes is reached, no new jobs can be run.
No decay, no automatic resource replenishment. After the
limit on GPU minutes is reached, each user should ask for
more minutes.
But despite exceeding the limits users *can* run new jobs.
* When I'm adding a user to the cluster I set:
sacctmgr --immediate add user name=...
...
QOS=2gpu2d
…
[View More]GrpTRESMins=gres/gpu=20000
* In the "slurm.conf" ("safe" means limits and associations
are automatically set). Storage is MariaDB with SlurmDBD:
GresTypes=gpu
AccountingStorageTRES=gres/gpu
AccountingStorageEnforce=qos,safe
# This disables GPU minutes replenishing.
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=NONE
But when I look at a user's account info and usage, you can
see that the limits are not enforced.
Account User Partition QOS GrpTRESMins
---------- ---------------- ------------ ------------ --------------------
redacted redacted a6000 2gpu2d
gres/gpu=10000
--------------------------------------------------------------------------------
Top 1 Users 2024-01-05T00:00:00 - 2024-01-17T19:59:59 (1108800 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
Login Used TRES Name
------------ -------- ----------------
redacted 184311 gres/gpu
redacted 1558558 cpu
Could someone explain, where could the problem be? Am I missing
something? Apparently yes :)
Kind regards
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
[View Less]
I would like to add a preemptable queue to our cluster. Actually I already
have. We simply want jobs submitted to that queue be preempted if there are
no resources available for jobs in other (high priority) queues.
Conceptually very simple, no conditionals, no choices, just what I wrote.
However it does not work as desired.
This is the relevant part:
grep -i Preemp /opt/slurm/slurm.conf
#PreemptType = preempt/partition_prio
PartitionName=regular DefMemPerCPU=4580 Default=True Nodes=node[01-…
[View More]12]
State=UP PreemptMode=off PriorityTier=200
PartitionName=All DefMemPerCPU=4580 Nodes=node[01-36] State=UP
PreemptMode=off PriorityTier=500
PartitionName=lowpriority DefMemPerCPU=4580 Nodes=node[01-36] State=UP
PreemptMode=cancel PriorityTier=100
That PreemptType setting (now commented) fully breaks slurm, everything
refuses to run with errors like
$ squeue
squeue: error: PreemptType and PreemptMode values incompatible
squeue: fatal: Unable to process configuration file
If I understand correctly the documentation at
https://slurm.schedmd.com/preempt.html that is because preemption cannot
cancel jobs based on partition priority, which (if true) is really
unfortunate. I understand that allowing cross-partition time-slicing could
be tricky and so I understand why that isn't allowed, but cancelling?
Anyway, I have to questions:
1) is that correct and so should I avoid using either partition priority or
cancelling?
2) is there an easy way to trick slurm into requeing and then have those
jobs cancelled instead?
3) I guess the cleanest option would be to implement QoS, but I've never
done it and we don't really need it for anything else other than this. The
documentation looks complicated, but is it? The great Ole's website is
unavailable at the moment...
Thanks!!
[View Less]
Yes, that makes sense. Thank you!
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over …
[View More]unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
[View Less]
What am I misunderstanding about how sacct filtering works here? I would have expected the second command to show the exact same results as the first.
[root@mickey ddrucker]# sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name zsh
JobID JobName State Elapsed
------------ ---------- ---------- ----------
257713 zsh COMPLETED 00:01:02
257714 zsh COMPLETED 00:04:01
257715 zsh …
[View More]COMPLETED 00:03:01
257716 zsh COMPLETED 00:03:01
[root@mickey ddrucker]# sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name zsh --state COMPLETED
JobID JobName State Elapsed
------------ ---------- ---------- ----------
[root@mickey ddrucker]# sinfo --version
slurm 21.08.8-2
--
Daniel M. Drucker, Ph.D.
Director of IT, MGB Imaging at Belmont
McLean Hospital, a Harvard Medical School Affiliate
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
[View Less]
> All I can say is that this has to do with --starttime and that you have to read the manual really carefully about how they interact, including when you have --endtime set. It’s a bit fiddly and annoying, IMO, and I can never quite remember how it works.
Oh, I think I understand. --starttime actually behaves differently when --state is present:
If states are given with the '-s' option then only jobs in this state at this time will be returned.
So is there a way to do what I want? I …
[View More]want to see jobs which
- started later than 7 days ago
- whose state is COMPLETED
Surely that's possible without resorting to grep?
Daniel
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
[View Less]
We have shuttered two clusters and need to remove them from the database. To do this, do we remove the table spaces associated with the cluster names from the Slurm database?
Thanks,
Jeff