[slurm-users] AutoDetect=nvml throwing an error message
Stephan Roth
stephan.roth at ee.ethz.ch
Fri Apr 16 10:18:19 UTC 2021
Hi Cristóbal
Under Debian Stretch/Buster I had to set
LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current for configure to find
the NVML shared library.
Best,
Stephan
On 15.04.21 19:46, Cristóbal Navarro wrote:
> Hi Michael,
> Thanks, Indeed I don't have it. Slurm must have not detected it.
> I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
> Here is some output, including the relevant paths for nvml.
> Is it possible to tell the slurm compilation to check these paths for nvml ?
> best
>
> *NVML PKG CHECK*
> ➜ ~ sudo apt search nvml
> Sorting... Done
> Full Text Search... Done
> cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
> NVML native dev links, headers
>
> cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
> NVML native dev links, headers
>
> cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
> NVML native dev links, headers
>
> *libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]
> NVIDIA Management Library (NVML) development files*
> python3-pynvml/focal 7.352.0-3 amd64
> Python3 bindings to the NVIDIA Management Library
>
>
>
> *NVML Shared library location*
> ➜ ~ find /usr/lib | grep libnvidia-ml
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
>
>
>
> *NVML Header*
> ➜ ~ find /usr | grep nvml
> /usr/include/nvml.h
>
>
>
>
> *SLURM LIBS*
> ➜ ~ ls /usr/lib64/slurm/
> accounting_storage_mysql.so* core_spec_none.so*
> job_submit_pbs.so* proctrack_pgid.so*
> accounting_storage_none.so* cred_munge.so*
> job_submit_require_timelimit.so* route_default.so*
> accounting_storage_slurmdbd.so* cred_none.so*
> job_submit_throttle.so* route_topology.so*
> acct_gather_energy_ibmaem.so* ext_sensors_none.so*
> launch_slurm.so* sched_backfill.so*
> acct_gather_energy_ipmi.so* gpu_generic.so*
> mcs_account.so* sched_builtin.so*
> acct_gather_energy_none.so* gres_gpu.so*
> mcs_group.so* sched_hold.so*
> acct_gather_energy_pm_counters.so* gres_mic.so*
> mcs_none.so* select_cons_res.so*
> acct_gather_energy_rapl.so* gres_mps.so*
> mcs_user.so* select_cons_tres.so*
> acct_gather_energy_xcc.so* gres_nic.so*
> mpi_none.so* select_linear.so*
> acct_gather_filesystem_lustre.so* jobacct_gather_cgroup.so*
> mpi_pmi2.so* site_factor_none.so*
> acct_gather_filesystem_none.so* jobacct_gather_linux.so*
> mpi_pmix.so@ slurmctld_nonstop.so*
> acct_gather_interconnect_none.so* jobacct_gather_none.so*
> mpi_pmix_v2.so* src/
> acct_gather_interconnect_ofed.so* jobcomp_elasticsearch.so*
> node_features_knl_generic.so* switch_none.so*
> acct_gather_profile_hdf5.so* jobcomp_filetxt.so*
> power_none.so* task_affinity.so*
> acct_gather_profile_influxdb.so* jobcomp_lua.so*
> preempt_none.so* task_cgroup.so*
> acct_gather_profile_none.so* jobcomp_mysql.so*
> preempt_partition_prio.so* task_none.so*
> auth_munge.so* jobcomp_none.so*
> preempt_qos.so* topology_3d_torus.so*
> burst_buffer_generic.so* jobcomp_script.so*
> prep_script.so* topology_hypercube.so*
> cli_filter_lua.so* job_container_cncu.so*
> priority_basic.so* topology_none.so*
> cli_filter_none.so* job_container_none.so*
> priority_multifactor.so* topology_tree.so*
> cli_filter_syslog.so* job_submit_all_partitions.so*
> proctrack_cgroup.so*
> cli_filter_user_defaults.so* job_submit_lua.so*
> proctrack_linuxproc.so*
>
> On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico
> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
>
> the error message sounds like when you built the slurm source it
> wasn't able to find the nvml devel packages. if you look in where you
> installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?
>
> On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
> <cristobal.navarro.g at gmail.com
> <mailto:cristobal.navarro.g at gmail.com>> wrote:
> >
> > typing error, should be --> **located at /usr/include/nvml.h**
> >
> > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro
> <cristobal.navarro.g at gmail.com
> <mailto:cristobal.navarro.g at gmail.com>> wrote:
> >>
> >> Hi community,
> >> I have set up the configuration files as mentioned in the
> documentation, but the slurmd of the GPU-compute node fails with the
> following error shown in the log.
> >> After reading the slurm documentation, it is not entirely clear
> to me how to properly set up GPU autodetection for the gres.conf
> file as it does not mention if the nvml detection should be
> automatic or not.
> >> I have also read the top google searches including
> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
> <https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html>
> but that was a problem of a cuda installation overwritten (not my case).
> >> This a DGX A100 node that comes with the Nvidia driver installed
> and nvml is located at /etc/include/nvml.h, not sure if there is a
> libnvml.so or similar as well.
> >> How to tell SLURM to look at those paths? any ideas of
> experience sharing is welcome.
> >> best
> >>
> >>
> >> slurmd.log (GPU node)
> >> [2021-04-14T17:31:42.302] got shutdown request
> >> [2021-04-14T17:31:42.302] all threads complete
> >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
> open '(null)/tasks' for reading : No such file or directory
> >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
> get pids of '(null)'
> >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
> open '(null)/tasks' for reading : No such file or directory
> >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
> get pids of '(null)'
> >> [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading
> >> [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini:
> unloading GPU Generic plugin
> >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini:
> select/cons_tres shutting down ...
> >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so:
> slurmd_exit = 0
> >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential
> signature plugin unloaded
> >> [2021-04-14T17:31:42.304] Slurmd shutdown completing
> >> [2021-04-14T17:31:42.321] debug: Log file re-opened
> >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
> >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
> >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
> >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
> >> [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> >> [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file
> /etc/slurm/cgroup.conf
> >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
> >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml
> file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
> >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
> >> [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8
> CoresPerSocket:16 ThreadsPerCore:2
> >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
> >> [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded
> >> [2021-04-14T17:31:42.449] fatal: We were configured to
> autodetect nvml functionality, but we weren't able to find that lib
> when Slurm was configured.
> >>
> >>
> >>
> >> gres.conf (just AutoDetect=nvml)
> >> ➜ ~ cat /etc/slurm/gres.conf
> >> # GRES configuration for native GPUS
> >> # DGX A100 8x Nvidia A100
> >> # not working, slurm cannot find nvml
> >> AutoDetect=nvml
> >> #Name=gpu File=/dev/nvidia[0-7]
> >> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
> >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
> >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
> >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
> >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
> >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
> >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
> >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
> >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
> >>
> >>
> >> slurm.conf
> >> GresTypes=gpu
> >> AccountingStorageTRES=gres/gpu
> >> DebugFlags=CPU_Bind,gres
> >>
> >> ## We don't want a node to go back in pool without sys admin
> acknowledgement
> >> ReturnToService=0
> >>
> >> ## Basic scheduling
> >> #SelectType=select/cons_res
> >> SelectType=select/cons_tres
> >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> >> SchedulerType=sched/backfill
> >>
> >> TaskPlugin=task/cgroup
> >> ProctrackType=proctrack/cgroup
> >>
> >> ## Nodes list
> >> ## use native GPUs
> >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
> ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8
> Feature=ht,gpu
> >>
> >> ## Partitions list
> >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8
> MaxTime=INFINITE State=UP Nodes=nodeGPU01 Default=YES
> >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
> MaxTime=INFINITE State=UP Nodes=nodeGPU01
> >> --
> >> Cristóbal A. Navarro
> >
> >
> >
> > --
> > Cristóbal A. Navarro
>
>
>
> --
> Cristóbal A. Navarro
-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
-------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210416/727d4360/attachment-0001.bin>
More information about the slurm-users
mailing list