[slurm-users] GPU / cgroup challenges

Tue May 1 17:00:36 MDT 2018

Paul, 

Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi'

When gres is unset you should see 

nvidia-smi
No devices were found

Otherwise, if you ask for 1 of 2, you should only see 1 device.

Also, I recall appending this to the bottom of 

[cgroup_allowed_devices_file.conf]
..
Same as yours
...
/dev/nvidia*

There was a SLURM bug issue that made this clear, not so much in the website docs.

-Kevin

On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" <slurm-users-bounces at lists.schedmd.com on behalf of rpwiegand at gmail.com> wrote:

    Greetings,

    I am setting up our new GPU cluster, and I seem to have a problem
    configuring things so that the devices are properly walled off via
    cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or
    set to --gres=gpu:0, I can access both GPUs from inside a job.
    Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
    environmental variable, I can access both GPUs.  From my
    understanding, this suggests that it is *not* being protected under
    cgroups.

    I've read the documentation, and I've read through a number of threads
    where people have resolved similar issues.  I've tried a lot of
    configurations, but to no avail. Below I include some snippets of
    relevant (current) parameters; however, I also am attaching most of
    our full conf files.

    [slurm.conf]
    ProctrackType=proctrack/cgroup
    TaskPlugin=task/cgroup
    SelectType=select/cons_res
    SelectTypeParameters=CR_Core_Memory
    JobAcctGatherType=jobacct_gather/linux
    AccountingStorageTRES=gres/gpu
    GresTypes=gpu

    NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
    ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2

    [gres.conf]
    NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
    COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
    NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
    COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

    [cgroup.conf]
    ConstrainDevices=yes

    [cgroup_allowed_devices_file.conf]
    /dev/null
    /dev/urandom
    /dev/zero
    /dev/sda*
    /dev/cpu/*/*
    /dev/pts/*

    Thanks,
    Paul.