[slurm-users] Query about Compute + GPUs
Markus Köberl
markus.koeberl at tugraz.at
Thu Nov 30 03:58:08 MST 2017
On Tuesday, 21 November 2017 16:38:48 CET Ing. Gonzalo E. Arroyo wrote:
> I have a problem detecting RAM and Arch (maybe some more), check this...
>
> NodeName=fisesta-21-3 Arch=x86_64 CoresPerSocket=1
> CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.01
> AvailableFeatures=rack-21,2CPUs
> ActiveFeatures=rack-21,2CPUs
> Gres=gpu:1
> NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3 Version=16.05
> OS=Linux RealMemory=3950 AllocMem=0 FreeMem=0 Sockets=2 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=259967 Weight=20479797 Owner=N/A
> MCS_label=N/A
> BootTime=2017-10-30T16:39:22 SlurmdStartTime=2017-11-06T16:46:54
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> NodeName=fisesta-21-3-cpus CoresPerSocket=1
> CPUAlloc=0 CPUErr=0 CPUTot=6 CPULoad=0.01
> AvailableFeatures=rack-21,6CPUs
> ActiveFeatures=rack-21,6CPUs
> Gres=(null)
> NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3-cpus Version=(null)
> RealMemory=1 AllocMem=0 FreeMem=0 Sockets=6 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=20483797 Owner=N/A
> MCS_label=N/A
> BootTime=None SlurmdStartTime=None
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I also saw the wrong Sockets, CPU and Threads. I did not recognize the wrong
values for RAM. Therefore I did define Sockets, CoresPerSocket, ThreadsPerCore
and RealMemory.
I did hope that slurm somehow tracks the memory so that it gets shared between
the partitions. I do not like to limit for both because depending on the user
we need between 2 and 200GB RAM per GPU...
> For your problem, please share the important lines of nodes and partitions,
> you should check your users have permission to run inside very partition /
> node splitted by this new configuration
I did already add this lines to my first mail:
NodeName=gpu1 NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002 Sockets=2
CoresPerSocket=3 ThreadsPerCore=2 Gres=gpu:TeslaK40c:6
NodeName=gpu1-cpu NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002 Sockets=2
CoresPerSocket=11 ThreadsPerCore=2
PartitionName=gpu Nodes=gpu1
PartitionName=cpu Nodes=gpu1-cpu
I get the following error if I submit to node gpu1-cpu:
[2017-11-21T09:06:55.840] launch task 999708.0 request from 1044.1000 at 10.1.2.3
(port 45252)
[2017-11-21T09:06:55.840] error: Invalid job 999708.0 credential for user
1044: host gpu1 not in hostset gpu1-cpu
[2017-11-21T09:06:55.840] error: Invalid job credential from 1044 at 10.1.2.3:
Invalid job credential
The node gpu has 2 sockets with each 14 cores and 2 threads per core 256GB RAM
and 6 Tesla K40c.
I will investigate it further the next time no jobs are running. I am unsure
what I can change without killing jobs. I already learned renaming a partition
or removing a node from a partition obviously kills jobs :-(
Any suggestions what I should look for?
regards
Markus
--
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl at tugraz.at
More information about the slurm-users
mailing list