[slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Tue Jan 26 20:40:10 UTC 2021

Thanks Paul!

On 26-01-2021 21:11, Paul Raines wrote:
> You should check your jobs that allocated GPUs and make sure
> CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
> you GPU support is not really there but SLURM is just doing "generic"
> resource assignment.

Could you elaborate a bit on this remark?  Are you saying that I need to 
check if CUDA_VISIBLE_DEVICES is defined automatically by Slurm inside 
the batch job as described in https://slurm.schedmd.com/gres.html?

What do you mean by "your GPU support is not really there" and Slurm 
doing "generic" resource assignment?  I'm just not understanding this...

With my Slurm 20.02.6 built without NVIDIA libraries, Slurm nevertheless 
seems to be scheduling multiple jobs so that different jobs are assigned 
to different GPUs.  The GRES=gpu* values point to distinct IDX values 
(GPU indexes).   The nvidia-smi command shows individual processes 
running on distinct GPUs.  All seems to be fine - or am I completely 
mistaken?

Thanks,
Ole