<div dir="ltr"><div>Hi Michael,</div><div>Thanks, Indeed I don't have it. Slurm must have not detected it. <br></div><div>I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)</div><div>Here is some output, including the relevant paths for nvml.</div><div>Is it possible to tell the slurm compilation to check these paths for nvml ?</div><div>best<br></div><div><br></div><div><b>NVML PKG CHECK</b><br></div><div><span style="font-family:monospace">➜ ~ sudo apt search nvml<br>Sorting... Done<br>Full Text Search... Done<br>cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64<br> NVML native dev links, headers<br><br>cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64<br> NVML native dev links, headers<br><br>cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64<br> NVML native dev links, headers<br><span style="color:rgb(111,168,220)"><span style="background-color:rgb(255,255,255)"><br><span style="color:rgb(204,0,0)"><b>libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]<br> NVIDIA Management Library (NVML) development files</b></span></span></span></span></div><div><span style="color:rgb(0,0,0)"><span style="font-family:monospace"><span style="background-color:rgb(255,255,255)">python3-pynvml/focal 7.352.0-3 amd64<br> Python3 bindings to the NVIDIA Management Library</span></span></span></div><div><br></div><div><br></div><div><br></div><div><b>NVML Shared library location</b><br></div><div><span style="font-family:monospace">➜ ~ find /usr/lib | grep libnvidia-ml<br>/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1<br>/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04<br>/usr/lib/x86_64-linux-gnu/libnvidia-ml.so</span></div><div><br></div><div><br></div><div><br></div><div><b>NVML Header</b></div><div><span style="font-family:monospace">➜ ~ find /usr | grep nvml<br>/usr/include/nvml.h</span><br></div><div><br></div><div><br></div><div><br><span style="font-family:monospace"></span></div><div><span style="font-family:monospace"></span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><span style="font-family:arial,sans-serif"><b>SLURM LIBS</b></span><br></span></div><div><span style="font-family:monospace">➜ ~ ls /usr/lib64/slurm/<br>accounting_storage_mysql.so* core_spec_none.so* job_submit_pbs.so* proctrack_pgid.so* <br>accounting_storage_none.so* cred_munge.so* job_submit_require_timelimit.so* route_default.so* <br>accounting_storage_slurmdbd.so* cred_none.so* job_submit_throttle.so* route_topology.so* <br>acct_gather_energy_ibmaem.so* ext_sensors_none.so* launch_slurm.so* sched_backfill.so* <br>acct_gather_energy_ipmi.so* gpu_generic.so* mcs_account.so* sched_builtin.so* <br>acct_gather_energy_none.so* gres_gpu.so* mcs_group.so* sched_hold.so* <br>acct_gather_energy_pm_counters.so* gres_mic.so* mcs_none.so* select_cons_res.so* <br>acct_gather_energy_rapl.so* gres_mps.so* mcs_user.so* select_cons_tres.so* <br>acct_gather_energy_xcc.so* gres_nic.so* mpi_none.so* select_linear.so* <br>acct_gather_filesystem_lustre.so* jobacct_gather_cgroup.so* mpi_pmi2.so* site_factor_none.so* <br>acct_gather_filesystem_none.so* jobacct_gather_linux.so* mpi_pmix.so@ slurmctld_nonstop.so* <br>acct_gather_interconnect_none.so* jobacct_gather_none.so* mpi_pmix_v2.so* src/ <br>acct_gather_interconnect_ofed.so* jobcomp_elasticsearch.so* node_features_knl_generic.so* switch_none.so* <br>acct_gather_profile_hdf5.so* jobcomp_filetxt.so* power_none.so* task_affinity.so* <br>acct_gather_profile_influxdb.so* jobcomp_lua.so* preempt_none.so* task_cgroup.so* <br>acct_gather_profile_none.so* jobcomp_mysql.so* preempt_partition_prio.so* task_none.so* <br>auth_munge.so* jobcomp_none.so* preempt_qos.so* topology_3d_torus.so* <br>burst_buffer_generic.so* jobcomp_script.so* prep_script.so* topology_hypercube.so* <br>cli_filter_lua.so* job_container_cncu.so* priority_basic.so* topology_none.so* <br>cli_filter_none.so* job_container_none.so* priority_multifactor.so* topology_tree.so* <br>cli_filter_syslog.so* job_submit_all_partitions.so* proctrack_cgroup.so* <br>cli_filter_user_defaults.so* job_submit_lua.so* proctrack_linuxproc.so* </span><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico <<a href="mailto:mdidomenico4@gmail.com">mdidomenico4@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">the error message sounds like when you built the slurm source it<br>
wasn't able to find the nvml devel packages. if you look in where you<br>
installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?<br>
<br>
On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro<br>
<<a href="mailto:cristobal.navarro.g@gmail.com" target="_blank">cristobal.navarro.g@gmail.com</a>> wrote:<br>
><br>
> typing error, should be --> **located at /usr/include/nvml.h**<br>
><br>
> On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <<a href="mailto:cristobal.navarro.g@gmail.com" target="_blank">cristobal.navarro.g@gmail.com</a>> wrote:<br>
>><br>
>> Hi community,<br>
>> I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log.<br>
>> After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.conf file as it does not mention if the nvml detection should be automatic or not.<br>
>> I have also read the top google searches including <a href="https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html" rel="noreferrer" target="_blank">https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html</a> but that was a problem of a cuda installation overwritten (not my case).<br>
>> This a DGX A100 node that comes with the Nvidia driver installed and nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar as well.<br>
>> How to tell SLURM to look at those paths? any ideas of experience sharing is welcome.<br>
>> best<br>
>><br>
>><br>
>> slurmd.log (GPU node)<br>
>> [2021-04-14T17:31:42.302] got shutdown request<br>
>> [2021-04-14T17:31:42.302] all threads complete<br>
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory<br>
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'<br>
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory<br>
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of '(null)'<br>
>> [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading<br>
>> [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini: unloading GPU Generic plugin<br>
>> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres shutting down ...<br>
>> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0<br>
>> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature plugin unloaded<br>
>> [2021-04-14T17:31:42.304] Slurmd shutdown completing<br>
>> [2021-04-14T17:31:42.321] debug: Log file re-opened<br>
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init<br>
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load<br>
>> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml<br>
>> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket<br>
>> [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2<br>
>> [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf<br>
>> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init<br>
>> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found<br>
>> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket<br>
>> [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8 CoresPerSocket:16 ThreadsPerCore:2<br>
>> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)<br>
>> [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded<br>
>> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.<br>
>><br>
>><br>
>><br>
>> gres.conf (just AutoDetect=nvml)<br>
>> ➜ ~ cat /etc/slurm/gres.conf<br>
>> # GRES configuration for native GPUS<br>
>> # DGX A100 8x Nvidia A100<br>
>> # not working, slurm cannot find nvml<br>
>> AutoDetect=nvml<br>
>> #Name=gpu File=/dev/nvidia[0-7]<br>
>> #Name=gpu Type=A100 File=/dev/nvidia[0-7]<br>
>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7<br>
>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15<br>
>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23<br>
>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31<br>
>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39<br>
>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47<br>
>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55<br>
>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63<br>
>><br>
>><br>
>> slurm.conf<br>
>> GresTypes=gpu<br>
>> AccountingStorageTRES=gres/gpu<br>
>> DebugFlags=CPU_Bind,gres<br>
>><br>
>> ## We don't want a node to go back in pool without sys admin acknowledgement<br>
>> ReturnToService=0<br>
>><br>
>> ## Basic scheduling<br>
>> #SelectType=select/cons_res<br>
>> SelectType=select/cons_tres<br>
>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE<br>
>> SchedulerType=sched/backfill<br>
>><br>
>> TaskPlugin=task/cgroup<br>
>> ProctrackType=proctrack/cgroup<br>
>><br>
>> ## Nodes list<br>
>> ## use native GPUs<br>
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu<br>
>><br>
>> ## Partitions list<br>
>> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE State=UP Nodes=nodeGPU01 Default=YES<br>
>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE State=UP Nodes=nodeGPU01<br>
>> --<br>
>> Cristóbal A. Navarro<br>
><br>
><br>
><br>
> --<br>
> Cristóbal A. Navarro<br>
<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Cristóbal A. Navarro<br></div></div></div></div></div></div>