[slurm-users] Slurm not detecting gpu after swapping out gpu

Dean Schulze dean.w.schulze at gmail.com
Mon Apr 27 16:47:59 UTC 2020


I replaced a Nvidia v100 with a t4.  Now slurm thinks there is no gpu
present:

$ sudo scontrol show node fabricnode2
NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:nvidia:1
   NodeAddr=fabricnode2 NodeHostName=fabricnode2 Version=19.05.4
   OS=Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC
2020
   RealMemory=7802 AllocMem=0 FreeMem=6828 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2020-04-27T10:24:18 SlurmdStartTime=2020-04-27T10:39:53
   CfgTRES=cpu=12,mem=7802M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=gres/gpu count reported lower than configured (0 < 1)
[root at 2020-04-27T10:34:25]

The gpu is there and I can execute cuda binaries on it.  nvidia-smi also
shows it present.  I've also rebooted the node, restarted slurmctld and run
reconfigure.

How does slurm determine if a gpu is present, because it's getting it wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200427/c1e97cd8/attachment-0001.htm>


More information about the slurm-users mailing list