[slurm-users] AutoDetect=nvml throwing an error message

Wed Apr 14 21:51:28 UTC 2021

typing error, should be --> **located at /usr/include/nvml.h**

On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <
cristobal.navarro.g at gmail.com> wrote:

> Hi community,
> I have set up the configuration files as mentioned in the documentation,
> but the slurmd of the GPU-compute node fails with the following error shown
> in the log.
> After reading the slurm documentation, it is not entirely clear to me how
> to properly set up GPU autodetection for the gres.conf file as it does not
> mention if the nvml detection should be automatic or not.
> I have also read the top google searches including
> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
> but that was a problem of a cuda installation overwritten (not my case).
> This a DGX A100 node that comes with the Nvidia driver installed and nvml
> is located at /etc/include/nvml.h, not sure if there is a libnvml.so or
> similar as well.
> How to tell SLURM to look at those paths? any ideas of experience sharing
> is welcome.
> best
>
>
> *slurmd.log (GPU node)*
> [2021-04-14T17:31:42.302] got shutdown request
> [2021-04-14T17:31:42.302] all threads complete
> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
> '(null)/tasks' for reading : No such file or directory
> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of
> '(null)'
> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
> '(null)/tasks' for reading : No such file or directory
> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of
> '(null)'
> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU
> Generic plugin
> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres
> shutting down ...
> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature
> plugin unloaded
> [2021-04-14T17:31:42.304] Slurmd shutdown completing
> [2021-04-14T17:31:42.321] debug:  Log file re-opened
> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file
> /etc/slurm/cgroup.conf
> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file
> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
> *[2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml
> functionality, but we weren't able to find that lib when Slurm was
> configured.*
>
>
>
>
> *gres.conf (just AutoDetect=nvml)*
> ➜  ~ cat /etc/slurm/gres.conf
> # GRES configuration for native GPUS
> # DGX A100 8x Nvidia A100
> # not working, slurm cannot find nvml
> AutoDetect=nvml
> #Name=gpu File=/dev/nvidia[0-7]
> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>
>
> *slurm.conf*
> GresTypes=gpu
> AccountingStorageTRES=gres/gpu
> DebugFlags=CPU_Bind,gres
>
> ## We don't want a node to go back in pool without sys admin
> acknowledgement
> ReturnToService=0
>
> ## Basic scheduling
> #SelectType=select/cons_res
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> SchedulerType=sched/backfill
>
> TaskPlugin=task/cgroup
> ProctrackType=proctrack/cgroup
>
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu
>
> ## Partitions list
> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE
> State=UP Nodes=nodeGPU01  Default=YES
> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
> State=UP Nodes=nodeGPU01
> --
> Cristóbal A. Navarro
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210414/99860958/attachment.htm>