Hello Experts,
I'm a new Slurm user (so please bare with me :) ...). Recently we've deployed Slurm version 23.11 on a very simple cluster, which consists of a Master node (acting as a Login & Slurmdbd node as well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as 4 x GPU's: GPU [0-4], and a Storage Array presenting/sharing the NFS disk (where users' home directories will be created as well).
The problem is that I've never been able to run a simple/dummy batch script in a parallel way using the 4 GPU's. In fact, running the same command "sbatch gpu-job.sh" multiple times shows that only one single job is running, while the other jobs are in a pending state:
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 214 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 215 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 216 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 216 gpu gpu-job slurmtest PD 0:00 1 (None) 215 gpu gpu-job slurmtest PD 0:00 1 (Priority) 214 gpu gpu-job slurmtest PD 0:00 1 (Priority) 213 gpu gpu-job slurmtest PD 0:00 1 (Priority) 212 gpu gpu-job slurmtest PD 0:00 1 (Resources) 211 gpu gpu-job slurmtest R 0:14 1 c-a100-cn01
PS: CPU jobs (i.e. using the default debug partition, without call the GPU Gres) can be run in parallel. The issue with running parallel jobs is only seen when using the GPU's as Gres.
I've tried many combinations of settings in gres.conf and slurm.conf, many (if not most) of these combinations would result in error messages in slurmctld and slurmd logs.
The current gres.conf and slurm.conf contents is shown below. Even though it doesn't give errors when restarting slurmctld and slurmd services (on master and compute nodes, resp.), but as I said, it doesn't allow jobs to be executed in parallel. Batch script contents shared below as well, in order to give more clarity on what I'm trying to do:
[root@c-a100-master slurm]# cat gres.conf | grep -v "^#" NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100 File=/dev/nvidia[0-3]
[root@c-a100-master slurm]# cat slurm.conf | grep -v "^#" | egrep -i "AccountingStorageTRES|GresTypes|NodeName|partition" GresTypes=gpu AccountingStorageTRES=gres/gpu NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=ALL MaxTime=10:0:0
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:4 #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j # Output file name (replaces %j with job ID) #SBATCH --error=gpu_job_error.%j # Error file name (replaces %j with job ID)
hostname date sleep 40 pwd
Any help on which changes need to be made to the config files (mainly slurm.conf and gres.cong) and/or the batch script, so that multiple jobs can be in a "Running" state at the same time (in parallel) ?
Thanks in advance for your help !
Best regards,
Hafedh Kherfani
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
Kind regards, Matt
On 2024-01-18 12:53, Kherfani, Hafedh (Professional Services, TC) wrote:
Hello Experts,
I'm a new Slurm user (so please bare with me :) ...). Recently we've deployed Slurm version 23.11 on a very simple cluster, which consists of a Master node (acting as a Login & Slurmdbd node as well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as 4 x GPU's: GPU [0-4], and a Storage Array presenting/sharing the NFS disk (where users' home directories will be created as well).
The problem is that I've never been able to run a simple/dummy batch script in a parallel way using the 4 GPU's. In fact, running the same command "sbatch gpu-job.sh" multiple times shows that only one single job is running, while the other jobs are in a pending state:
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 214 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 215 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 216 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 216 gpu gpu-job slurmtest PD 0:00 1 (None) 215 gpu gpu-job slurmtest PD 0:00 1 (Priority) 214 gpu gpu-job slurmtest PD 0:00 1 (Priority) 213 gpu gpu-job slurmtest PD 0:00 1 (Priority) 212 gpu gpu-job slurmtest PD 0:00 1 (Resources) 211 gpu gpu-job slurmtest R 0:14 1 c-a100-cn01
PS: CPU jobs (i.e. using the default debug partition, without call the GPU Gres) can be run in parallel. The issue with running parallel jobs is only seen when using the GPU's as Gres.
I've tried many combinations of settings in gres.conf and slurm.conf, many (if not most) of these combinations would result in error messages in slurmctld and slurmd logs.
The current gres.conf and slurm.conf contents is shown below. Even though it doesn't give errors when restarting slurmctld and slurmd services (on master and compute nodes, resp.), but as I said, it doesn't allow jobs to be executed in parallel. Batch script contents shared below as well, in order to give more clarity on what I'm trying to do:
[root@c-a100-master slurm]# cat gres.conf | grep -v "^#" NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100 File=/dev/nvidia[0-3]
[root@c-a100-master slurm]# cat slurm.conf | grep -v "^#" | egrep -i "AccountingStorageTRES|GresTypes|NodeName|partition" GresTypes=gpu AccountingStorageTRES=gres/gpu NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=ALL MaxTime=10:0:0
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:4 #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j # Output file name (replaces %j with job ID) #SBATCH --error=gpu_job_error.%j # Error file name (replaces %j with job ID)
hostname date sleep 40 pwd
Any help on which changes need to be made to the config files (mainly slurm.conf and gres.cong) and/or the batch script, so that multiple jobs can be in a "Running" state at the same time (in parallel) ?
Thanks in advance for your help !
Best regards,
Hafedh Kherfani
On Jan 18, 2024, at 7:31 AM, Matthias Loose m.loose@mindcode.de wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
Hi Noam and Matthias,
Thanks both for your answers.
I changed the "#SBATCH --gres=gpu:4" directive (in the batch script) with "#SBATCH --gres=gpu:1" as you suggested, but it didn't make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state ...
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:1 # <<<< Changed from '4' to '1' #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j #SBATCH --error=gpu_job_error.%j
hostname date sleep 40 pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 217 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 218 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 219 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 219 gpu gpu-job slurmtes PD 0:00 1 (Priority) 218 gpu gpu-job slurmtes PD 0:00 1 (Resources) 217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01
Basically I'm seeking for some help/hints on how to tell Slurm, from the batch script for example: "I want only 1 or 2 GPUs to be used/consumed by the job", and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
Hafedh
From: slurm-users slurm-users-bounces@lists.schedmd.com On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: jeudi 18 janvier 2024 2:30 PM To: Slurm User Community List slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose@mindcode.demailto:m.loose@mindcode.de> wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Managementhttps://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
This line also has tobe changed:
#SBATCH --gpus-per-node=4 • #SBATCH --gpus-per-node=1
--gpus-per-node seems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely.
Best Ümit
From: slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Kherfani, Hafedh (Professional Services, TC) hafedh.kherfani@hpe.com Date: Thursday, 18. January 2024 at 15:40 To: Slurm User Community List slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’ #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j #SBATCH --error=gpu_job_error.%j
hostname date sleep 40 pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 217 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 218 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 219 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 219 gpu gpu-job slurmtes PD 0:00 1 (Priority) 218 gpu gpu-job slurmtes PD 0:00 1 (Resources) 217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
Hafedh
From: slurm-users slurm-users-bounces@lists.schedmd.com On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: jeudi 18 janvier 2024 2:30 PM To: Slurm User Community List slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose@mindcode.demailto:m.loose@mindcode.de> wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
Hi
I'm not an expert, but is it possible that the currently running jobs is consuming the whole node because it is allocated the whole memory of the node (so the other 2 jobs had to wait until it finishes)? Maybe if you try to restrict the required memory for each job?
Regards
On Thu, Jan 18, 2024 at 4:46 PM Ümit Seren uemit.seren@gmail.com wrote:
This line also has tobe changed:
#SBATCH --gpus-per-node=4 #SBATCH --gpus-per-node=1
--gpus-per-node seems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely.
Best
Ümit
*From: *slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Kherfani, Hafedh (Professional Services, TC) hafedh.kherfani@hpe.com *Date: *Thursday, 18. January 2024 at 15:40 *To: *Slurm User Community List slurm-users@lists.schedmd.com *Subject: *Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j
#SBATCH --error=gpu_job_error.%j
hostname
date
sleep 40
pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *217*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
217 gpu gpu-job slurmtes R 0:02 1
c-a100-cn01
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *218*
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *219*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
219 gpu gpu-job slurmtes *PD* 0:00 1
(Priority)
218 gpu gpu-job slurmtes *PD* 0:00 1
(Resources)
217 gpu gpu-job slurmtes *R* 0:07 1
c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
*Hafedh *
*From:* slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) *Sent:* jeudi 18 janvier 2024 2:30 PM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose m.loose@mindcode.de wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
+1 on checking the memory allocation. Or add/check if you have any DefMemPerX set in your slurm.conf
On Fri, Jan 19, 2024 at 12:33 AM mohammed shambakey shambakey1@gmail.com wrote:
Hi
I'm not an expert, but is it possible that the currently running jobs is consuming the whole node because it is allocated the whole memory of the node (so the other 2 jobs had to wait until it finishes)? Maybe if you try to restrict the required memory for each job?
Regards
On Thu, Jan 18, 2024 at 4:46 PM Ümit Seren uemit.seren@gmail.com wrote:
This line also has tobe changed:
#SBATCH --gpus-per-node=4 #SBATCH --gpus-per-node=1
--gpus-per-node seems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely.
Best
Ümit
*From: *slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Kherfani, Hafedh (Professional Services, TC) hafedh.kherfani@hpe.com *Date: *Thursday, 18. January 2024 at 15:40 *To: *Slurm User Community List slurm-users@lists.schedmd.com *Subject: *Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j
#SBATCH --error=gpu_job_error.%j
hostname
date
sleep 40
pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *217*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
217 gpu gpu-job slurmtes R 0:02 1
c-a100-cn01
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *218*
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *219*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
219 gpu gpu-job slurmtes *PD* 0:00 1
(Priority)
218 gpu gpu-job slurmtes *PD* 0:00 1
(Resources)
217 gpu gpu-job slurmtes *R* 0:07 1
c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
*Hafedh *
*From:* slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) *Sent:* jeudi 18 janvier 2024 2:30 PM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose m.loose@mindcode.de wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
-- Mohammed
Also, remembre to specify the memory used by the job if you treat it as a TRES if you're using CR_*Memory to select resources.
Diego
Il 18/01/2024 15:44, Ümit Seren ha scritto:
This line also has tobe changed:
#SBATCH --gpus-per-node=4#SBATCH --gpus-per-node=1
--gpus-per-nodeseems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely.
Best
Ümit
*From: *slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Kherfani, Hafedh (Professional Services, TC) hafedh.kherfani@hpe.com *Date: *Thursday, 18. January 2024 at 15:40 *To: *Slurm User Community List slurm-users@lists.schedmd.com *Subject: *Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j
#SBATCH --error=gpu_job_error.%j
hostname
date
sleep 40
pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *217*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *218*
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *219*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
219 gpu gpu-job slurmtes *PD* 0:00 1 (Priority)
218 gpu gpu-job slurmtes *PD* 0:00 1 (Resources)
217 gpu gpu-job slurmtes *R* 0:07 1 c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
**
*Hafedh *
*From:*slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) *Sent:* jeudi 18 janvier 2024 2:30 PM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose@mindcode.de <mailto:m.loose@mindcode.de>> wrote: Hi Hafedh, Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again. I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management <https://slurm.schedmd.com/gres.html#MPS_Management>
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
Hi Hafedh,
Your job script has the sbatch directive “—gpus-per-node=4” set. I suspect that if you look at what’s allocated to the running job by doing “scontrol show job <jobid>” and looking at the TRES field, it’s been allocated 4 GPUs instead of one.
Regards, --Troy
From: slurm-users slurm-users-bounces@lists.schedmd.com On Behalf Of Kherfani, Hafedh (Professional Services, TC) Sent: Thursday, January 18, 2024 9:38 AM To: Slurm User Community List slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias, Thanks both for your answers. I changed the “#SBATCH --gres=gpu: 4“ directive (in the batch script) with “#SBATCH --gres=gpu: 1“ as you suggested, but it didn’t make a difference, as running
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’ #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j #SBATCH --error=gpu_job_error.%j
hostname date sleep 40 pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 217 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 218 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 219 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 219 gpu gpu-job slurmtes PD 0:00 1 (Priority) 218 gpu gpu-job slurmtes PD 0:00 1 (Resources) 217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
Hafedh
From: slurm-users <slurm-users-bounces@lists.schedmd.commailto:slurm-users-bounces@lists.schedmd.com> On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: jeudi 18 janvier 2024 2:30 PM To: Slurm User Community List <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose@mindcode.demailto:m.loose@mindcode.de> wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Managementhttps://urldefense.com/v3/__https:/slurm.schedmd.com/gres.html*MPS_Management__;Iw!!KGKeukY!y8lBvIzVTUcjaJKXNVaSGxEyG-AgFP9NRgOW7uAUJNfWzKHN1Bc9YwXNuwlXGigW0JBn6IzA-XrgVsuHFf2E$
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
Hi Ümit, Troy,
I removed the line “#SBATCH --gres=gpu:1”, and changed the sbatch directive “--gpus-per-node=4” to “--gpus-per-node=1”, but still getting the same result: When running multiple sbatch commands for the same script, only one job (first execution) is running, and all subsequent jobs are in a pending state (REASON being reported as “Resources” for immediately next job in the queue, and “Priority” for remaining ones) …
As for the output from “scontrol show job <jobid>” command: I don’t see a “TRES” field on its own .. I see the field “TresPerNode=gres/gpu:1” (the value in the end f the line will correspond to the value specified in the “--gpus-per-node=” directive.
PS: Is it normal/expected (in the output of scontrol show job command) to have “Features=(null)” ? I was expecting to see Features=gpu ….
Best regards,
Hafedh
From: slurm-users slurm-users-bounces@lists.schedmd.com On Behalf Of Baer, Troy Sent: jeudi 18 janvier 2024 3:47 PM To: Slurm User Community List slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Hafedh,
Your job script has the sbatch directive “—gpus-per-node=4” set. I suspect that if you look at what’s allocated to the running job by doing “scontrol show job <jobid>” and looking at the TRES field, it’s been allocated 4 GPUs instead of one.
Regards, --Troy
From: slurm-users <slurm-users-bounces@lists.schedmd.commailto:slurm-users-bounces@lists.schedmd.com> On Behalf Of Kherfani, Hafedh (Professional Services, TC) Sent: Thursday, January 18, 2024 9:38 AM To: Slurm User Community List <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias, Thanks both for your answers. I changed the “#SBATCH --gres=gpu: 4“ directive (in the batch script) with “#SBATCH --gres=gpu: 1“ as you suggested, but it didn’t make a difference, as running
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’ #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j #SBATCH --error=gpu_job_error.%j
hostname date sleep 40 pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 217 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 218 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 219 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 219 gpu gpu-job slurmtes PD 0:00 1 (Priority) 218 gpu gpu-job slurmtes PD 0:00 1 (Resources) 217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
Hafedh
From: slurm-users <slurm-users-bounces@lists.schedmd.commailto:slurm-users-bounces@lists.schedmd.com> On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: jeudi 18 janvier 2024 2:30 PM To: Slurm User Community List <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose@mindcode.demailto:m.loose@mindcode.de> wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Managementhttps://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.
Maybe also post the output of scontrol show job <jobid> to check the other resources allocated for the job.
On Thu, Jan 18, 2024, 19:22 Kherfani, Hafedh (Professional Services, TC) < hafedh.kherfani@hpe.com> wrote:
Hi Ümit, Troy,
I removed the line “#SBATCH --gres=gpu:1”, and changed the sbatch directive “--gpus-per-node=4” to “--gpus-per-node=1”, but still getting the same result: When running multiple sbatch commands for the same script, only one job (first execution) is running, and all subsequent jobs are in a pending state (REASON being reported as “Resources” for immediately next job in the queue, and “Priority” for remaining ones) …
As for the output from “scontrol show job <jobid>” command: I don’t see a “TRES” field on its own .. I see the field “TresPerNode=gres/gpu:1” (the value in the end f the line will correspond to the value specified in the “--gpus-per-node=” directive.
PS: Is it normal/expected (in the output of scontrol show job command) to have “Features=(null)” ? I was expecting to see Features=gpu ….
Best regards,
*Hafedh *
*From:* slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Baer, Troy *Sent:* jeudi 18 janvier 2024 3:47 PM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Hafedh,
Your job script has the sbatch directive “—gpus-per-node=4” set. I suspect that if you look at what’s allocated to the running job by doing “scontrol show job <jobid>” and looking at the TRES field, it’s been allocated 4 GPUs instead of one.
Regards,
--Troy
*From:* slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Kherfani, Hafedh (Professional Services, TC) *Sent:* Thursday, January 18, 2024 9:38 AM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
Hi Noam and Matthias, Thanks both for your answers. I changed the “#SBATCH --gres=gpu: 4“ directive (in the batch script) with “#SBATCH --gres=gpu: 1“ as you suggested, but it didn’t make a difference, as running
Hi Noam and Matthias,
Thanks both for your answers.
I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state …
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ to ‘1’
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j
#SBATCH --error=gpu_job_error.%j
hostname
date
sleep 40
pwd
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *217*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
217 gpu gpu-job slurmtes R 0:02 1
c-a100-cn01
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *218*
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job *219*
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
219 gpu gpu-job slurmtes *PD* 0:00 1
(Priority)
218 gpu gpu-job slurmtes *PD* 0:00 1
(Resources)
217 gpu gpu-job slurmtes *R* 0:07 1
c-a100-cn01
Basically I’m seeking for some help/hints on how to tell Slurm, from the batch script for example: “I want only 1 or 2 GPUs to be used/consumed by the job”, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time.
Makes sense ?
Best regards,
*Hafedh *
*From:* slurm-users slurm-users-bounces@lists.schedmd.com *On Behalf Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) *Sent:* jeudi 18 janvier 2024 2:30 PM *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
On Jan 18, 2024, at 7:31 AM, Matthias Loose m.loose@mindcode.de wrote:
Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management
I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4.
If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need.