[slurm-users] GPU / cgroup challenges
kmanalo at jhu.edu
Tue May 1 17:00:36 MDT 2018
Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi'
When gres is unset you should see
No devices were found
Otherwise, if you ask for 1 of 2, you should only see 1 device.
Also, I recall appending this to the bottom of
Same as yours
There was a SLURM bug issue that made this clear, not so much in the website docs.
On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" <slurm-users-bounces at lists.schedmd.com on behalf of rpwiegand at gmail.com> wrote:
I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups. Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
environmental variable, I can access both GPUs. From my
understanding, this suggests that it is *not* being protected under
I've read the documentation, and I've read through a number of threads
where people have resolved similar issues. I've tried a lot of
configurations, but to no avail. Below I include some snippets of
relevant (current) parameters; however, I also am attaching most of
our full conf files.
NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2
NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
More information about the slurm-users