[slurm-users] Building Slurm RPMs with NVIDIA GPU support?
Paul Raines
raines at nmr.mgh.harvard.edu
Tue Jan 26 21:11:48 UTC 2021
Yes, you need to check inside the job.
This is a while ago now, but I am pretty sure I remember though from
the SLURM accounting aspect the jobs were being assigned GPUs fine as you
would see in 'scontrol show job' or 'sacct --job', the CUDA_VISIBLE_DEVICES
environment variable was not being set so all jobs on the node saw
(an possibly used) all GPUs and would get into conflicts.
YOu could see this by doing something like
login$ srun --ntasks-per-node=1 --cpus-per-task=4 --gpus 2 --pty /bin/bash
node$ echo $CUDA_VISIBLE_DEVICES
and get nothing. After installing the proper RPMs on the GPU node with
the NVML support and doing the test again I get
login$ srun --ntasks-per-node=1 --cpus-per-task=4 --gpus 2 --pty /bin/bash
node$ echo $CUDA_VISIBLE_DEVICES
0,1
If CUDA_VISIBLE_DEVICES is not being set, also search for error messages
in your slurmd log file on the node
This also probably requires you to have
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
GresTypes=gpu
like I do
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 26 Jan 2021 3:40pm, Ole Holm Nielsen wrote:
> Thanks Paul!
>
> On 26-01-2021 21:11, Paul Raines wrote:
>> You should check your jobs that allocated GPUs and make sure
>> CUDA_VISIBLE_DEVICES is being set in the environment. This is a sign
>> you GPU support is not really there but SLURM is just doing "generic"
>> resource assignment.
>
> Could you elaborate a bit on this remark? Are you saying that I need to
> check if CUDA_VISIBLE_DEVICES is defined automatically by Slurm inside the
> batch job as described in https://slurm.schedmd.com/gres.html?
>
> What do you mean by "your GPU support is not really there" and Slurm doing
> "generic" resource assignment? I'm just not understanding this...
>
> With my Slurm 20.02.6 built without NVIDIA libraries, Slurm nevertheless
> seems to be scheduling multiple jobs so that different jobs are assigned to
> different GPUs. The GRES=gpu* values point to distinct IDX values (GPU
> indexes). The nvidia-smi command shows individual processes running on
> distinct GPUs. All seems to be fine - or am I completely mistaken?
>
> Thanks,
> Ole
>
>
>
>
More information about the slurm-users
mailing list