[slurm-users] NVML not found when Slurm was configured.

Michael Lewis mike.lewis at queensu.ca
Fri Nov 11 23:12:58 UTC 2022


Yes sorry Rob, I mean I did build and install with --with-nvml and it didn't find it.  I then tried again specifying the location.  Unfortunately, at that point users needed to run a few jobs and I wasn't able to further investigate.  I will get back at it when they've finished.

Mike
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Groner, Rob <rug262 at psu.edu>
Sent: Friday, November 11, 2022 5:07 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVML not found when Slurm was configured.

I'm not sure what you mean by "didn't work out for me".  The error indicates slurm wasn't correctly configured for nvml when it was built, so the first step would be to get the slurm source and run configure --with-nvml and see what it says.

There's a CHANCE the error indicates slurm can't find the  libnvidia-ml.so library on the system it is currently running on, so you might try installing that package and see if slurm finds it.  But I'm pretty sure it means slurm needs to be configured and built from source.

Rob


________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Michael Lewis <mike.lewis at queensu.ca>
Sent: Friday, November 11, 2022 3:34 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVML not found when Slurm was configured.

You don't often get email from mike.lewis at queensu.ca. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>

Unfortunately this didn’t work out for me or I’m simply doing it wrong.  When the current users hop off the system I’ll do some more troubleshooting.  Any other insight or tips to steer me in the right direction are greatly appreciated.



Mike



From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Michael Lewis <mike.lewis at queensu.ca>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Friday, November 11, 2022 at 10:01 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVML not found when Slurm was configured.



Thanks Rob!  No I just grabbed it through apt.  I’ll try that now.



Mike



From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of "Groner, Rob" <rug262 at psu.edu>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Friday, November 11, 2022 at 9:32 AM
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVML not found when Slurm was configured.



Hi Mike,



I can't tell if you're compiling slurm or not on your own.  You will have to if you want the functionality.



On RedHat8, I had to install cuda-nvml-devel-11-7, so find what the equivalent is for that in Ubuntu.  Basically, whatever package includes nvml.h and libnvidia-ml.so.  Then, modify your configure statement when building slurm to add "--with-nvml".  Check the configure output, because it may still not find it (it didn't on our system because we installed the devel package to a non-standard location.  If that's the case, you just change it to --with-nvml=<path to nvml lib dir>.  Then it should all work.



I'll note once it's all setup, then your gres.conf becomes just "<nodenames> AutoDetect=nvml"



G'luck.



rob



________________________________

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Michael Lewis <mike.lewis at queensu.ca>
Sent: Friday, November 11, 2022 9:12 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] NVML not found when Slurm was configured.



You don't often get email from mike.lewis at queensu.ca. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>

Hello Everyone,



New here and very new to slurm and hopefully someone can shed some light on this for me.  I’m in the process of setting up a single node slurm environment with nvidia a100.  I keep getting the error We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.  when trying to start slurmd.  When removing GresTypes=gpu from slurm.conf slurmd starts up fine and can queue up and run jobs.  Cuda toolkit is installed along with NVIDIA Management Library (NVML).  I went as far as removing slurm and reinstalling to see if it would pick it up.  No go.



OS Ubuntu 20.04,  slurm.conf GresTypes=gpu is added, gres.conf AutoDetect=nvml Name=gpu Type=a100 File=/dev/nvidia0 COREs=0,1



I’ve searched around and see that many others have run into this but I haven’t found a fix yet.  Any help would be greatly appreciated.



Thanks,



Mike




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221111/4fbc9e1a/attachment-0001.htm>


More information about the slurm-users mailing list