[slurm-users] Bug in gres plugin

Vladimir Goy vovagoy at gmail.com
Tue May 22 00:26:50 MDT 2018


Dear, Danny Auble, Developers and Users,

After update SLURM from version 17.02.2 to the 17.11.6 The behavior of the
plugin gres has changed.

On the version 17.02.2, gres.conf can be:
Name=gpu Type=K40 File=/dev/nvidia0   COREs=0
Name=gpu Type=K40 File=/dev/nvidia1   COREs=10
Name=gpu Type=cpu                     COREs=2-9,12-19 Count=16
Name=gpu Type=debugcpu                COREs=1,11      Count=2

All GPU jobs starts succesfully on the slurm-17.02.2.
But now slurm-17.11.6 does not set variables CUDA_VISIBLE_DEVICES, and all
jobs on the same node use only one GPU.  This is due to the generation of
errors in the function common_gres_set_env(...) in file s
rc/plugins/gres/common/gres_common.c:196
for this
len = bit_size(bit_alloc); //Equal to 20
list_count(gres_devices) equl to 2

Why is gres.conf not working now? I use this gres.conf to be sure that
COREs=0,10 used only with GPU and never for tasks without GPU.

In general bug can be in
src/plugins/gres/common/gres_common.c
src/common/gres.c

I think I can not fix this problem by myself.
Who know solution for this problem?

Best regards, Vova.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180522/a4b34e0c/attachment.html>


More information about the slurm-users mailing list