[slurm-users] DefMemPerGPU bug?
Wayne Hendricks
waynehendricks at gmail.com
Thu Mar 26 15:42:20 UTC 2020
When using 20.02/cons_tres and defining DefMemPerGPU, jobs submitted that request GPUs without defining “—mem” will not run more than one job per node. I can see where it is allocating the correct amount of memory for the job per GPUs requested, but no other jobs will run on the node. If a value for “—mem” is defined, other jobs will share the node. Is this the expected behavior? I understand that when jobs do not request memory it is assumed that the job is running on the whole node, but here when we are asking for GPUs there is a default memory set with DefMemPerGPU and it seems this is not being taken into account. Let me know if there is a reason for this behavior or if there is another way to set the default job memory.
Config:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
PartitionName=p100 Nodes=ucs480 OverSubscribe=FORCE:4 DefCpuPerGPU=20 DefMemPerGPU=125000 Default=YES MaxTime=INFINITE State=UP
Node and job state when two jobs submitted with each requesting half the GPUs (no —mem specified):
CfgTRES=cpu=80,mem=500000M,billing=80
AllocTRES=cpu=40,mem=250000M
Job state:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
872 p100 test-s6 wayne.he PD 0:00 1 (Resources)
871 p100 test-s5 wayne.he R 0:03 1 ucs480
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200326/51e5cf8b/attachment.htm>
More information about the slurm-users
mailing list