[slurm-users] GPU / cgroup challenges

Mon May 21 05:17:19 MDT 2018

I am following up on this to first thank everyone for their suggestion and also let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem.  Our GPUs are now properly walled off via cgroups per our existing config.

Thanks!

Paul.

> On May 5, 2018, at 9:04 AM, Chris Samuel <chris at csamuel.org> wrote:
> 
> On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote:
> 
>> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
>> as:
>> 
>> [2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
>> /dev/nvidia0 for job
>> [2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
>> device /dev/nvidia1 for job
>> 
>> However, I can still "see" both devices from nvidia-smi, and I can
>> still access both if I manually unset CUDA_VISIBLE_DEVICES.
> 
> The only thing I can think of is a bug that's been fixed since 17.11.0 (as I 
> know it works for us with 17.11.5) or a kernel bug (or missing device 
> cgroups).
> 
> Sorry I can't be more helpful!
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 
>