[slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

Chris Samuel chris at csamuel.org
Fri Aug 31 19:35:40 MDT 2018


On Friday, 31 August 2018 1:48:33 AM AEST Chaofeng Zhang wrote:

> This result should be CUDA_VISIBLE_DEVICES=NoDevFiles, and it really is
> NoDevFiles in 17.02. So this must be a bug in 17.11.7.

Looking at git it looks like this code got refactored out of the GPU GRES plugin
and in to some common GRES code for 17.11 in this commit:

commit 0e0cdd7d791ee48e5c4a44c307eea0d521ce91d0
Author: Danny Auble <da at schedmd.com>
Date:   Thu Oct 5 15:35:00 2017 -0600

    Convert the 3 different arrays used for devices in GRES into a nice structure.
    Not only that, but also make it so the slurmd sends this information over to
    the stepd on init.

    This also makes it so GRES of the same name and different types can happen.


If you have a support contract for Slurm I would suggest opening a bug
with them about this change in behaviour, it feels like it's not expected.

However, this will not save you from users setting CUDA_VISIBLE_DEVICES
themselves and accessing GPUs they are not meant to, you really really do
need to use cgroups to stop that happening.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC






More information about the slurm-users mailing list