[slurm-users] AutoDetect=nvml throwing an error message

Cristóbal Navarro cristobal.navarro.g at gmail.com
Wed Apr 14 21:47:37 UTC 2021


Hi community,
I have set up the configuration files as mentioned in the documentation,
but the slurmd of the GPU-compute node fails with the following error shown
in the log.
After reading the slurm documentation, it is not entirely clear to me how
to properly set up GPU autodetection for the gres.conf file as it does not
mention if the nvml detection should be automatic or not.
I have also read the top google searches including
https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
but that was a problem of a cuda installation overwritten (not my case).
This a DGX A100 node that comes with the Nvidia driver installed and nvml
is located at /etc/include/nvml.h, not sure if there is a libnvml.so or
similar as well.
How to tell SLURM to look at those paths? any ideas of experience sharing
is welcome.
best


*slurmd.log (GPU node)*
[2021-04-14T17:31:42.302] got shutdown request
[2021-04-14T17:31:42.302] all threads complete
[2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
'(null)/tasks' for reading : No such file or directory
[2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of
'(null)'
[2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
'(null)/tasks' for reading : No such file or directory
[2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of
'(null)'
[2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
[2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU
Generic plugin
[2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres
shutting down ...
[2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
[2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature
plugin unloaded
[2021-04-14T17:31:42.304] Slurmd shutdown completing
[2021-04-14T17:31:42.321] debug:  Log file re-opened
[2021-04-14T17:31:42.321] debug2: hwloc_topology_init
[2021-04-14T17:31:42.321] debug2: hwloc_topology_load
[2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
[2021-04-14T17:31:42.446] Considering each NUMA node as a socket
[2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8
CoresPerSocket:16 ThreadsPerCore:2
[2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file
/etc/slurm/cgroup.conf
[2021-04-14T17:31:42.447] debug2: hwloc_topology_init
[2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file
(/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
[2021-04-14T17:31:42.448] Considering each NUMA node as a socket
[2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8
CoresPerSocket:16 ThreadsPerCore:2
[2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
[2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
*[2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml
functionality, but we weren't able to find that lib when Slurm was
configured.*




*gres.conf (just AutoDetect=nvml)*
➜  ~ cat /etc/slurm/gres.conf
# GRES configuration for native GPUS
# DGX A100 8x Nvidia A100
# not working, slurm cannot find nvml
AutoDetect=nvml
#Name=gpu File=/dev/nvidia[0-7]
#Name=gpu Type=A100 File=/dev/nvidia[0-7]
#Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
#Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
#Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
#Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
#Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
#Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
#Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
#Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63


*slurm.conf*
GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres

## We don't want a node to go back in pool without sys admin acknowledgement
ReturnToService=0

## Basic scheduling
#SelectType=select/cons_res
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu

## Partitions list
PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE
State=UP Nodes=nodeGPU01  Default=YES
PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
State=UP Nodes=nodeGPU01
-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210414/4a72ac1c/attachment.htm>


More information about the slurm-users mailing list