[slurm-users] DefMemPerGPU bug?
Bas van der Vlies
bas.vandervlies at surfsara.nl
Mon Mar 30 10:48:14 UTC 2020
We have the same issue see:
* https://bugs.schedmd.com/show_bug.cgi?id=8527
* temporary fix we switched back to DefMemPerCpu
regards
On 26/03/2020 16:42, Wayne Hendricks wrote:
> When using 20.02/cons_tres and defining DefMemPerGPU, jobs submitted
> that request GPUs without defining “—mem” will not run more than one job
> per node. I can see where it is allocating the correct amount of memory
> for the job per GPUs requested, but no other jobs will run on the node.
> If a value for “—mem” is defined, other jobs will share the node. Is
> this the expected behavior? I understand that when jobs do not request
> memory it is assumed that the job is running on the whole node, but here
> when we are asking for GPUs there is a default memory set with
> DefMemPerGPU and it seems this is not being taken into account. Let me
> know if there is a reason for this behavior or if there is another way
> to set the default job memory.
>
> Config:
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
> PartitionName=p100 Nodes=ucs480 OverSubscribe=FORCE:4 DefCpuPerGPU=20
> DefMemPerGPU=125000 Default=YES MaxTime=INFINITE State=UP
>
> Node and job state when two jobs submitted with each requesting half the
> GPUs (no —mem specified):
>
> CfgTRES=cpu=80,mem=500000M,billing=80
> AllocTRES=cpu=40,mem=250000M
>
> Job state:
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 872 p100 test-s6 wayne.he PD 0:00 1 (Resources)
> 871 p100 test-s5 wayne.he R 0:03 1 ucs480
--
--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG
Amsterdam
| T +31 (0) 20 800 1300 | bas.vandervlies at surfsara.nl | www.surfsara.nl |
More information about the slurm-users
mailing list