[slurm-users] Query about Compute + GPUs

Thu Nov 30 03:58:08 MST 2017

On Tuesday, 21 November 2017 16:38:48 CET Ing. Gonzalo E. Arroyo wrote:
> I have a problem detecting RAM and Arch (maybe some more), check this...
> 
> NodeName=fisesta-21-3 Arch=x86_64 CoresPerSocket=1
>    CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.01
>    AvailableFeatures=rack-21,2CPUs
>    ActiveFeatures=rack-21,2CPUs
>    Gres=gpu:1
>    NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3 Version=16.05
>    OS=Linux RealMemory=3950 AllocMem=0 FreeMem=0 Sockets=2 Boards=1
>    State=IDLE ThreadsPerCore=1 TmpDisk=259967 Weight=20479797 Owner=N/A
> MCS_label=N/A
>    BootTime=2017-10-30T16:39:22 SlurmdStartTime=2017-11-06T16:46:54
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> 
> NodeName=fisesta-21-3-cpus CoresPerSocket=1
>    CPUAlloc=0 CPUErr=0 CPUTot=6 CPULoad=0.01
>    AvailableFeatures=rack-21,6CPUs
>    ActiveFeatures=rack-21,6CPUs
>    Gres=(null)
>    NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3-cpus Version=(null)
>    RealMemory=1 AllocMem=0 FreeMem=0 Sockets=6 Boards=1
>    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=20483797 Owner=N/A
> MCS_label=N/A
>    BootTime=None SlurmdStartTime=None
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I also saw the wrong Sockets, CPU and Threads. I did not recognize the wrong 
values for RAM. Therefore I did define Sockets, CoresPerSocket, ThreadsPerCore 
and RealMemory.
I did hope that slurm somehow tracks the memory so that it gets shared between 
the partitions. I do not like to limit for both because depending on the user 
we need between 2 and 200GB RAM per GPU...

> For your problem, please share the important lines of nodes and partitions,
> you should check your users have permission to run inside very partition /
> node splitted by this new configuration

I did already add this lines to my first mail:

NodeName=gpu1 NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002  Sockets=2 
CoresPerSocket=3 ThreadsPerCore=2 Gres=gpu:TeslaK40c:6

NodeName=gpu1-cpu NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002  Sockets=2 
CoresPerSocket=11 ThreadsPerCore=2

PartitionName=gpu Nodes=gpu1
PartitionName=cpu Nodes=gpu1-cpu

I get the following error if I submit to node gpu1-cpu:

[2017-11-21T09:06:55.840] launch task 999708.0 request from 1044.1000 at 10.1.2.3 
(port 45252)
[2017-11-21T09:06:55.840] error: Invalid job 999708.0 credential for user 
1044: host gpu1 not in hostset gpu1-cpu
[2017-11-21T09:06:55.840] error: Invalid job credential from 1044 at 10.1.2.3: 
Invalid job credential

The node gpu has 2 sockets with each 14 cores and 2 threads per core 256GB RAM 
and 6 Tesla K40c. 

I will investigate it further the next time no jobs are running. I am unsure 
what I can change without killing jobs. I already learned renaming a partition 
or removing a node from a partition obviously kills jobs :-(

Any suggestions what I should look for?

regards
Markus
-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl at tugraz.at