[slurm-users] AutoDetect=nvml throwing an error message

Fri Apr 16 10:18:19 UTC 2021

Hi Cristóbal

Under Debian Stretch/Buster I had to set 
LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current for configure to find 
the NVML shared library.

Best,
Stephan

On 15.04.21 19:46, Cristóbal Navarro wrote:
> Hi Michael,
> Thanks, Indeed I don't have it. Slurm must have not detected it.
> I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
> Here is some output, including the relevant paths for nvml.
> Is it possible to tell the slurm compilation to check these paths for nvml ?
> best
> 
> *NVML PKG CHECK*
> ➜  ~ sudo apt search nvml
> Sorting... Done
> Full Text Search... Done
> cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
>    NVML native dev links, headers
> 
> cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
>    NVML native dev links, headers
> 
> cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
>    NVML native dev links, headers
> 
> *libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]
>    NVIDIA Management Library (NVML) development files*
> python3-pynvml/focal 7.352.0-3 amd64
>    Python3 bindings to the NVIDIA Management Library
> 
> 
> 
> *NVML Shared library location*
> ➜  ~ find /usr/lib | grep libnvidia-ml
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
> 
> 
> 
> *NVML Header*
> ➜  ~ find /usr | grep nvml
> /usr/include/nvml.h
> 
> 
> 
> 
> *SLURM LIBS*
> ➜  ~ ls /usr/lib64/slurm/
> accounting_storage_mysql.so*        core_spec_none.so*                 
>   job_submit_pbs.so*                  proctrack_pgid.so*
> accounting_storage_none.so*         cred_munge.so*                     
>   job_submit_require_timelimit.so*    route_default.so*
> accounting_storage_slurmdbd.so*     cred_none.so*                       
> job_submit_throttle.so*             route_topology.so*
> acct_gather_energy_ibmaem.so*       ext_sensors_none.so*               
>   launch_slurm.so*                    sched_backfill.so*
> acct_gather_energy_ipmi.so*         gpu_generic.so*                     
> mcs_account.so*                     sched_builtin.so*
> acct_gather_energy_none.so*         gres_gpu.so*                       
>   mcs_group.so*                       sched_hold.so*
> acct_gather_energy_pm_counters.so*  gres_mic.so*                       
>   mcs_none.so*                        select_cons_res.so*
> acct_gather_energy_rapl.so*         gres_mps.so*                       
>   mcs_user.so*                        select_cons_tres.so*
> acct_gather_energy_xcc.so*          gres_nic.so*                       
>   mpi_none.so*                        select_linear.so*
> acct_gather_filesystem_lustre.so*   jobacct_gather_cgroup.so*           
> mpi_pmi2.so*                        site_factor_none.so*
> acct_gather_filesystem_none.so*     jobacct_gather_linux.so*           
>   mpi_pmix.so@                        slurmctld_nonstop.so*
> acct_gather_interconnect_none.so*   jobacct_gather_none.so*             
> mpi_pmix_v2.so*                     src/
> acct_gather_interconnect_ofed.so*   jobcomp_elasticsearch.so*           
> node_features_knl_generic.so*       switch_none.so*
> acct_gather_profile_hdf5.so*        jobcomp_filetxt.so*                 
> power_none.so*                      task_affinity.so*
> acct_gather_profile_influxdb.so*    jobcomp_lua.so*                     
> preempt_none.so*                    task_cgroup.so*
> acct_gather_profile_none.so*        jobcomp_mysql.so*                   
> preempt_partition_prio.so*          task_none.so*
> auth_munge.so*                      jobcomp_none.so*                   
>   preempt_qos.so*                     topology_3d_torus.so*
> burst_buffer_generic.so*            jobcomp_script.so*                 
>   prep_script.so*                     topology_hypercube.so*
> cli_filter_lua.so*                  job_container_cncu.so*             
>   priority_basic.so*                  topology_none.so*
> cli_filter_none.so*                 job_container_none.so*             
>   priority_multifactor.so*            topology_tree.so*
> cli_filter_syslog.so*               job_submit_all_partitions.so*       
> proctrack_cgroup.so*
> cli_filter_user_defaults.so*        job_submit_lua.so*                 
>   proctrack_linuxproc.so*
> 
> On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico 
> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
> 
>     the error message sounds like when you built the slurm source it
>     wasn't able to find the nvml devel packages.  if you look in where you
>     installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?
> 
>     On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
>     <cristobal.navarro.g at gmail.com
>     <mailto:cristobal.navarro.g at gmail.com>> wrote:
>      >
>      > typing error, should be --> **located at /usr/include/nvml.h**
>      >
>      > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro
>     <cristobal.navarro.g at gmail.com
>     <mailto:cristobal.navarro.g at gmail.com>> wrote:
>      >>
>      >> Hi community,
>      >> I have set up the configuration files as mentioned in the
>     documentation, but the slurmd of the GPU-compute node fails with the
>     following error shown in the log.
>      >> After reading the slurm documentation, it is not entirely clear
>     to me how to properly set up GPU autodetection for the gres.conf
>     file as it does not mention if the nvml detection should be
>     automatic or not.
>      >> I have also read the top google searches including
>     https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
>     <https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html>
>     but that was a problem of a cuda installation overwritten (not my case).
>      >> This a DGX A100 node that comes with the Nvidia driver installed
>     and nvml is located at /etc/include/nvml.h, not sure if there is a
>     libnvml.so or similar as well.
>      >> How to tell SLURM to look at those paths? any ideas of
>     experience sharing is welcome.
>      >> best
>      >>
>      >>
>      >> slurmd.log (GPU node)
>      >> [2021-04-14T17:31:42.302] got shutdown request
>      >> [2021-04-14T17:31:42.302] all threads complete
>      >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
>     open '(null)/tasks' for reading : No such file or directory
>      >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
>     get pids of '(null)'
>      >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
>     open '(null)/tasks' for reading : No such file or directory
>      >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
>     get pids of '(null)'
>      >> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
>      >> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini:
>     unloading GPU Generic plugin
>      >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini:
>     select/cons_tres shutting down ...
>      >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so:
>     slurmd_exit = 0
>      >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential
>     signature plugin unloaded
>      >> [2021-04-14T17:31:42.304] Slurmd shutdown completing
>      >> [2021-04-14T17:31:42.321] debug:  Log file re-opened
>      >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
>      >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
>      >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
>      >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
>      >> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8
>     CoresPerSocket:16 ThreadsPerCore:2
>      >> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file
>     /etc/slurm/cgroup.conf
>      >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
>      >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml
>     file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
>      >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
>      >> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8
>     CoresPerSocket:16 ThreadsPerCore:2
>      >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
>      >> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
>      >> [2021-04-14T17:31:42.449] fatal: We were configured to
>     autodetect nvml functionality, but we weren't able to find that lib
>     when Slurm was configured.
>      >>
>      >>
>      >>
>      >> gres.conf (just AutoDetect=nvml)
>      >> ➜  ~ cat /etc/slurm/gres.conf
>      >> # GRES configuration for native GPUS
>      >> # DGX A100 8x Nvidia A100
>      >> # not working, slurm cannot find nvml
>      >> AutoDetect=nvml
>      >> #Name=gpu File=/dev/nvidia[0-7]
>      >> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
>      >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>      >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>      >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>      >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>      >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>      >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>      >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>      >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>      >>
>      >>
>      >> slurm.conf
>      >> GresTypes=gpu
>      >> AccountingStorageTRES=gres/gpu
>      >> DebugFlags=CPU_Bind,gres
>      >>
>      >> ## We don't want a node to go back in pool without sys admin
>     acknowledgement
>      >> ReturnToService=0
>      >>
>      >> ## Basic scheduling
>      >> #SelectType=select/cons_res
>      >> SelectType=select/cons_tres
>      >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>      >> SchedulerType=sched/backfill
>      >>
>      >> TaskPlugin=task/cgroup
>      >> ProctrackType=proctrack/cgroup
>      >>
>      >> ## Nodes list
>      >> ## use native GPUs
>      >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
>     ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8
>     Feature=ht,gpu
>      >>
>      >> ## Partitions list
>      >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8
>     MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
>      >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
>     MaxTime=INFINITE State=UP Nodes=nodeGPU01
>      >> --
>      >> Cristóbal A. Navarro
>      >
>      >
>      >
>      > --
>      > Cristóbal A. Navarro
> 
> 
> 
> -- 
> Cristóbal A. Navarro

-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59  |  ETF D 104  |  Sternwartstrasse 7  | 8092 Zurich
-------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210416/727d4360/attachment-0001.bin>