[slurm-users] Slurm not detecting gpu after swapping out gpu
Dean Schulze
dean.w.schulze at gmail.com
Mon Apr 27 16:47:59 UTC 2020
I replaced a Nvidia v100 with a t4. Now slurm thinks there is no gpu
present:
$ sudo scontrol show node fabricnode2
NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=0.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:nvidia:1
NodeAddr=fabricnode2 NodeHostName=fabricnode2 Version=19.05.4
OS=Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC
2020
RealMemory=7802 AllocMem=0 FreeMem=6828 Sockets=1 Boards=1
State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=debug
BootTime=2020-04-27T10:24:18 SlurmdStartTime=2020-04-27T10:39:53
CfgTRES=cpu=12,mem=7802M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=gres/gpu count reported lower than configured (0 < 1)
[root at 2020-04-27T10:34:25]
The gpu is there and I can execute cuda binaries on it. nvidia-smi also
shows it present. I've also rebooted the node, restarted slurmctld and run
reconfigure.
How does slurm determine if a gpu is present, because it's getting it wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200427/c1e97cd8/attachment-0001.htm>
More information about the slurm-users
mailing list