[slurm-users] Can one specify attributes on a GRES resource?

Will Dennis wdennis at nec-labs.com
Fri Mar 22 02:39:50 UTC 2019


I tried doing this as follows:

Node's gres.conf:
##################################################################
# Slurm's Generic Resource (GRES) configuration file
##################################################################
Name=gpu File=/dev/nvidia0 Type=1050TI
Name=gpu_mem_per_card Count=4G
Name=gpu_cores_per_card Count=768

>From slurm.conf:
NodeName=n75 CPUs=32 Gres=gpu:1,gpu_mem_per_card:no_consume:4G,gpu_cores_per_card:no_consume:768 Feature=GPUMODEL_1050TI

But, when I restarted slurmctld, distributed the slurm.conf to the cluster nodes, and did a "scontrol reconfigure", the node went into a "DRAIN" state, with the following

[root at slurm-controller ~]# scontrol show node n75
NodeName=n75 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=0.01
   AvailableFeatures=GPUMODEL_1050TI
   ActiveFeatures=GPUMODEL_1050TI
   Gres=gpu:1,gpu_mem_per_card:no_consume:4G,gpu_cores_per_card:no_consume:768
   NodeAddr=n75 NodeHostName=n75 Version=16.05
   OS=Linux RealMemory=128905 AllocMem=0 FreeMem=124347 Sockets=32 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=240180 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-02-13T14:35:03 SlurmdStartTime=2019-02-13T15:07:27
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=gres/gpu_mem_per_card count too low (0 < 4294967296) [root at 2019-03-21T22:23:59]  <<<<<<<<<<<<<<<


Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix this?



-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Quirin Lohr
Sent: Wednesday, March 20, 2019 4:06 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Can one specify attributes on a GRES resource?

Hi Will,

I solved this by creating a new GRES:
Some nodes have VRAM:no_consume:12G
Some nodes have VRAM:no_consume:24G

"no_consume" because it would be for the whole node otherwise.

It only works because the nodes only have one type of GPUs each.

It is then requested with --gres=gpu:1,VRAM:16G

Here an extract of my slurm.conf


> NodeName=node7  Gres=gpu:p6000:8,VRAM:no_consume:24G   Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=257843 Weight=10 Feature=p6000
> NodeName=node6  Gres=gpu:titanxpascal:8,VRAM:no_consume:12G Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=257854 Weight=1  Feature=titanxp

The cudacores could be implemented accordingly (which is a nice idea btw.).


Regards
Quirin


-- 
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Mustererkennung

Boltzmannstrasse 3
85748 Garching

Tel. +49 89 289 17769
Fax +49 89 289 17757

quirin.lohr at in.tum.de
www.vision.in.tum.de




More information about the slurm-users mailing list