[slurm-users] AutoDetect=nvml throwing an error message

Thu Apr 15 17:46:56 UTC 2021

Hi Michael,
Thanks, Indeed I don't have it. Slurm must have not detected it.
I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
Here is some output, including the relevant paths for nvml.
Is it possible to tell the slurm compilation to check these paths for nvml ?
best

*NVML PKG CHECK*
➜  ~ sudo apt search nvml
Sorting... Done
Full Text Search... Done
cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
  NVML native dev links, headers

*libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]  NVIDIA Management
Library (NVML) development files*
python3-pynvml/focal 7.352.0-3 amd64
  Python3 bindings to the NVIDIA Management Library

*NVML Shared library location*
➜  ~ find /usr/lib | grep libnvidia-ml
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

*NVML Header*
➜  ~ find /usr | grep nvml
/usr/include/nvml.h

*SLURM LIBS*
➜  ~ ls /usr/lib64/slurm/
accounting_storage_mysql.so*        core_spec_none.so*
 job_submit_pbs.so*                  proctrack_pgid.so*
accounting_storage_none.so*         cred_munge.so*
 job_submit_require_timelimit.so*    route_default.so*
accounting_storage_slurmdbd.so*     cred_none.so*
job_submit_throttle.so*             route_topology.so*
acct_gather_energy_ibmaem.so*       ext_sensors_none.so*
 launch_slurm.so*                    sched_backfill.so*
acct_gather_energy_ipmi.so*         gpu_generic.so*
mcs_account.so*                     sched_builtin.so*
acct_gather_energy_none.so*         gres_gpu.so*
 mcs_group.so*                       sched_hold.so*
acct_gather_energy_pm_counters.so*  gres_mic.so*
 mcs_none.so*                        select_cons_res.so*
acct_gather_energy_rapl.so*         gres_mps.so*
 mcs_user.so*                        select_cons_tres.so*
acct_gather_energy_xcc.so*          gres_nic.so*
 mpi_none.so*                        select_linear.so*
acct_gather_filesystem_lustre.so*   jobacct_gather_cgroup.so*
mpi_pmi2.so*                        site_factor_none.so*
acct_gather_filesystem_none.so*     jobacct_gather_linux.so*
 mpi_pmix.so@                        slurmctld_nonstop.so*
acct_gather_interconnect_none.so*   jobacct_gather_none.so*
mpi_pmix_v2.so*                     src/
acct_gather_interconnect_ofed.so*   jobcomp_elasticsearch.so*
node_features_knl_generic.so*       switch_none.so*
acct_gather_profile_hdf5.so*        jobcomp_filetxt.so*
power_none.so*                      task_affinity.so*
acct_gather_profile_influxdb.so*    jobcomp_lua.so*
preempt_none.so*                    task_cgroup.so*
acct_gather_profile_none.so*        jobcomp_mysql.so*
preempt_partition_prio.so*          task_none.so*
auth_munge.so*                      jobcomp_none.so*
 preempt_qos.so*                     topology_3d_torus.so*
burst_buffer_generic.so*            jobcomp_script.so*
 prep_script.so*                     topology_hypercube.so*
cli_filter_lua.so*                  job_container_cncu.so*
 priority_basic.so*                  topology_none.so*
cli_filter_none.so*                 job_container_none.so*
 priority_multifactor.so*            topology_tree.so*
cli_filter_syslog.so*               job_submit_all_partitions.so*
proctrack_cgroup.so*
cli_filter_user_defaults.so*        job_submit_lua.so*
 proctrack_linuxproc.so*

On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico <mdidomenico4 at gmail.com>
wrote:

> the error message sounds like when you built the slurm source it
> wasn't able to find the nvml devel packages.  if you look in where you
> installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?
>
> On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
> <cristobal.navarro.g at gmail.com> wrote:
> >
> > typing error, should be --> **located at /usr/include/nvml.h**
> >
> > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <
> cristobal.navarro.g at gmail.com> wrote:
> >>
> >> Hi community,
> >> I have set up the configuration files as mentioned in the
> documentation, but the slurmd of the GPU-compute node fails with the
> following error shown in the log.
> >> After reading the slurm documentation, it is not entirely clear to me
> how to properly set up GPU autodetection for the gres.conf file as it does
> not mention if the nvml detection should be automatic or not.
> >> I have also read the top google searches including
> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
> but that was a problem of a cuda installation overwritten (not my case).
> >> This a DGX A100 node that comes with the Nvidia driver installed and
> nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so
> or similar as well.
> >> How to tell SLURM to look at those paths? any ideas of experience
> sharing is welcome.
> >> best
> >>
> >>
> >> slurmd.log (GPU node)
> >> [2021-04-14T17:31:42.302] got shutdown request
> >> [2021-04-14T17:31:42.302] all threads complete
> >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
> '(null)/tasks' for reading : No such file or directory
> >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids
> of '(null)'
> >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open
> '(null)/tasks' for reading : No such file or directory
> >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids
> of '(null)'
> >> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
> >> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading
> GPU Generic plugin
> >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini:
> select/cons_tres shutting down ...
> >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
> >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature
> plugin unloaded
> >> [2021-04-14T17:31:42.304] Slurmd shutdown completing
> >> [2021-04-14T17:31:42.321] debug:  Log file re-opened
> >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
> >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
> >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
> >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
> >> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> >> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file
> /etc/slurm/cgroup.conf
> >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
> >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file
> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
> >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
> >> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
> >> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
> >> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml
> functionality, but we weren't able to find that lib when Slurm was
> configured.
> >>
> >>
> >>
> >> gres.conf (just AutoDetect=nvml)
> >> ➜  ~ cat /etc/slurm/gres.conf
> >> # GRES configuration for native GPUS
> >> # DGX A100 8x Nvidia A100
> >> # not working, slurm cannot find nvml
> >> AutoDetect=nvml
> >> #Name=gpu File=/dev/nvidia[0-7]
> >> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
> >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
> >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
> >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
> >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
> >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
> >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
> >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
> >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
> >>
> >>
> >> slurm.conf
> >> GresTypes=gpu
> >> AccountingStorageTRES=gres/gpu
> >> DebugFlags=CPU_Bind,gres
> >>
> >> ## We don't want a node to go back in pool without sys admin
> acknowledgement
> >> ReturnToService=0
> >>
> >> ## Basic scheduling
> >> #SelectType=select/cons_res
> >> SelectType=select/cons_tres
> >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> >> SchedulerType=sched/backfill
> >>
> >> TaskPlugin=task/cgroup
> >> ProctrackType=proctrack/cgroup
> >>
> >> ## Nodes list
> >> ## use native GPUs
> >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu
> >>
> >> ## Partitions list
> >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE
> State=UP Nodes=nodeGPU01  Default=YES
> >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
> MaxTime=INFINITE State=UP Nodes=nodeGPU01
> >> --
> >> Cristóbal A. Navarro
> >
> >
> >
> > --
> > Cristóbal A. Navarro
>
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210415/222db363/attachment-0001.htm>