[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Thu Jan 18 12:31:30 UTC 2024

Hi Hafedh,

Im no expert in the GPU side of SLURM, but looking at you current 
configuration to me its working as intended at the moment. You have 
defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So 
the jobs wait for the ressource the be free again.

I think what you need to look into is the MPS plugin, which seems to do 
what you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management

Kind regards,
   Matt

On 2024-01-18 12:53, Kherfani, Hafedh (Professional Services, TC) wrote:
> Hello Experts,
> 
> I'm a new Slurm user (so please bare with me :)  ...).
> Recently we've deployed Slurm version 23.11 on a very simple cluster,
> which consists of a Master node (acting as a Login & Slurmdbd node as
> well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU,
> detected as 4 x GPU's: GPU [0-4], and a Storage Array
> presenting/sharing the NFS disk (where users' home directories will be
> created as well).
> 
> The problem is that I've never been able to run a simple/dummy batch
> script in a parallel way using the 4 GPU's. In fact, running the same
> command "sbatch gpu-job.sh" multiple times shows that only one single
> job is running, while the other jobs are in a pending state:
> 
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
> Submitted batch job 214
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
> Submitted batch job 215
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
> Submitted batch job 216
> [slurmtest at c-a100-master test-batch-scripts]$ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                216       gpu  gpu-job   slurmtest PD       0:00      1 
> (None)
>                215       gpu  gpu-job   slurmtest PD       0:00      1
> (Priority)
>                214       gpu  gpu-job   slurmtest PD       0:00      1
> (Priority)
>                213       gpu  gpu-job   slurmtest PD       0:00      1
> (Priority)
>                212       gpu  gpu-job   slurmtest PD       0:00      1
> (Resources)
>                211       gpu  gpu-job   slurmtest R       0:14      1
> c-a100-cn01
> 
> PS: CPU jobs (i.e. using the default debug partition, without call the
> GPU Gres) can be run in parallel. The issue with running parallel jobs
> is only seen when using the GPU's as Gres.
> 
> I've tried many combinations of settings in gres.conf and slurm.conf,
> many (if not most) of these combinations would result in error
> messages in slurmctld and slurmd logs.
> 
> The current gres.conf and slurm.conf contents is shown below. Even
> though it doesn't give errors when restarting slurmctld and slurmd
> services (on master and compute nodes, resp.), but as I said, it
> doesn't allow jobs to be executed in parallel. Batch script contents
> shared below as well, in order to give more clarity on what I'm trying
> to do:
> 
> [root at c-a100-master slurm]# cat gres.conf | grep -v "^#"
> NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100 
> File=/dev/nvidia[0-3]
> 
> [root at c-a100-master slurm]# cat slurm.conf | grep -v "^#" | egrep -i
> "AccountingStorageTRES|GresTypes|NodeName|partition"
> GresTypes=gpu
> AccountingStorageTRES=gres/gpu
> NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1
> SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181
> State=UNKNOWN
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> PartitionName=gpu Nodes=ALL MaxTime=10:0:0
> 
> [slurmtest at c-a100-master test-batch-scripts]$ cat gpu-job.sh
> #!/bin/bash
> #SBATCH --job-name=gpu-job
> #SBATCH --partition=gpu
> #SBATCH --nodes=1
> #SBATCH --gpus-per-node=4
> #SBATCH --gres=gpu:4
> #SBATCH --tasks-per-node=1
> #SBATCH --output=gpu_job_output.%j   # Output file name (replaces %j
> with job ID)
> #SBATCH --error=gpu_job_error.%j     # Error file name (replaces %j 
> with job ID)
> 
> hostname
> date
> sleep 40
> pwd
> 
> 
> Any help on which changes need to be made to the config files (mainly
> slurm.conf and gres.cong) and/or the batch script, so that multiple
> jobs can be in a "Running" state at the same time (in parallel) ?
> 
> Thanks in advance for your help !
> 
> 
> Best regards,
> 
> Hafedh Kherfani