[slurm-users] Slurm not detecting gpu after swapping out gpu

Sarlo, Jeffrey S JSarlo at Central.UH.EDU
Mon Apr 27 18:02:52 UTC 2020


How do you have fabricnode2 defined in your gres.conf file and the slurm.conf file?  Since the type of gpu changed, maybe the definition for it needs to be updated also.

Jeff

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Dean Schulze <dean.w.schulze at gmail.com>
Sent: Monday, April 27, 2020 11:47 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Slurm not detecting gpu after swapping out gpu

I replaced a Nvidia v100 with a t4.  Now slurm thinks there is no gpu present:

$ sudo scontrol show node fabricnode2
NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:nvidia:1
   NodeAddr=fabricnode2 NodeHostName=fabricnode2 Version=19.05.4
   OS=Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020
   RealMemory=7802 AllocMem=0 FreeMem=6828 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-04-27T10:24:18 SlurmdStartTime=2020-04-27T10:39:53
   CfgTRES=cpu=12,mem=7802M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=gres/gpu count reported lower than configured (0 < 1) [root at 2020-04-27T10:34:25]

The gpu is there and I can execute cuda binaries on it.  nvidia-smi also shows it present.  I've also rebooted the node, restarted slurmctld and run reconfigure.

How does slurm determine if a gpu is present, because it's getting it wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200427/ff4d098d/attachment.htm>


More information about the slurm-users mailing list