[slurm-users] AutoDetect=nvml throwing an error message

Thu Apr 15 12:59:44 UTC 2021

the error message sounds like when you built the slurm source it
wasn't able to find the nvml devel packages.  if you look in where you
installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?

On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
<cristobal.navarro.g at gmail.com> wrote:
>
> typing error, should be --> **located at /usr/include/nvml.h**
>
> On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <cristobal.navarro.g at gmail.com> wrote:
>>
>> Hi community,
>> I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log.
>> After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.conf file as it does not mention if the nvml detection should be automatic or not.
>> I have also read the top google searches including https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html but that was a problem of a cuda installation overwritten (not my case).
>> This a DGX A100 node that comes with the Nvidia driver installed and nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar as well.
>> How to tell SLURM to look at those paths? any ideas of experience sharing is welcome.
>> best
>>
>>
>> slurmd.log (GPU node)
>> [2021-04-14T17:31:42.302] got shutdown request
>> [2021-04-14T17:31:42.302] all threads complete
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'
>> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
>> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU Generic plugin
>> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres shutting down ...
>> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
>> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature plugin unloaded
>> [2021-04-14T17:31:42.304] Slurmd shutdown completing
>> [2021-04-14T17:31:42.321] debug:  Log file re-opened
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
>> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
>> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
>> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
>> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
>> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
>> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
>>
>>
>>
>> gres.conf (just AutoDetect=nvml)
>> ➜  ~ cat /etc/slurm/gres.conf
>> # GRES configuration for native GPUS
>> # DGX A100 8x Nvidia A100
>> # not working, slurm cannot find nvml
>> AutoDetect=nvml
>> #Name=gpu File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>
>>
>> slurm.conf
>> GresTypes=gpu
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>>
>> ## We don't want a node to go back in pool without sys admin acknowledgement
>> ReturnToService=0
>>
>> ## Basic scheduling
>> #SelectType=select/cons_res
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>> SchedulerType=sched/backfill
>>
>> TaskPlugin=task/cgroup
>> ProctrackType=proctrack/cgroup
>>
>> ## Nodes list
>> ## use native GPUs
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu
>>
>> ## Partitions list
>> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE State=UP Nodes=nodeGPU01
>> --
>> Cristóbal A. Navarro
>
>
>
> --
> Cristóbal A. Navarro