[slurm-users] memory limit questions
Paul Raines
raines at nmr.mgh.harvard.edu
Fri Jul 13 10:12:39 MDT 2018
I am trying slurm for the first time on test machines, version 17.02.7
on CentOS7.5 boxes.
Relevant lines from my slurm.conf
ProctrackType=proctrack/cgroup
SwitchType=switch/none
PropagateResourceLimitsExcept=MEMLOCK
TaskPlugin=task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightPartition=10000
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30
NodeName=icenode00 Procs=6 State=UNKNOWN RealMemory=62000
...
NodeName=icenode05 Procs=6 State=UNKNOWN RealMemory=62000
PartitionName=nmrdef Nodes=icenode[00,02,03,04,05] Default=YES
MaxTime=7-00:00:00 DefaultTime=3-00:00:00 State=UP PriorityJobFactor=1000
LLN=Yes
PartitionName=p6 Nodes=icenode[00,02,03,04,05] MaxTime=7-00:00:00
DefaultTime=3-00:00:00 State=UP PriorityJobFactor=6000 LLN=Yes
And cgroup.conf has
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=95.0
I submit my first "real" job like this and it runs fine
$ sbatch -N 1 --ntasks-per-node=1 -c 4 --mem=20G -p p6 -o test1.out
mriqctest.sh
Submitted batch job 23
$ sacct -j 23 -o reqmem,maxvmsize,maxrss,exitcode,totalCPU,elapsed
ReqMem MaxVMSize MaxRSS ExitCode TotalCPU Elapsed
---------- ---------- ---------- -------- ---------- ----------
20Gn 0:0 00:15.254 00:01:18
20Gn 140504K 357712K 0:0 00:15.254 00:01:18
I don't understand why I get these two different lines from sacct, but
whatever. What confuses me is how can MaxRSS be greater than MaxVMSize?
I am guessing just a sampling timing issue since the job was < 2 minutes.
Anyway, I want to test the memory limit contraint so I next submit with
$ sbatch -N 1 --ntasks-per-node=1 -c 1 --mem=100M -p p6 -o test2.out
mriqctest.sh
Submitted batch job 24
Running top shows mriqc processing using a ton of CPU for over
15 minutes and zero output to test2.out. Finally I kill it
[root at icestorm ~]# sacct -j 24 -o
reqmem,maxvmsize,maxrss,exitcode,totalCPU,elapsed
ReqMem MaxVMSize MaxRSS ExitCode TotalCPU Elapsed
---------- ---------- ---------- -------- ---------- ----------
100Mn 15:0 19:41.284 00:19:43
100Mn 140504K 45656K 15:0 19:41.284 00:19:43
At that point I see in test2.out
/var/spool/slurm/d/job00024/slurm_script: line 15: 8854 Terminated
singularity exec -B $PWD/ds008_R2.0.0:/data:ro -B $PWD/out$1:/out
/usr/pubsw/packages/mriqc/current/mriqc.simg mriqc --no-sub /data /out
participant --participant_label sub-15
slurmstepd: error: Exceeded step memory limit at some point.
So my question is why did the job not kill itself for exceeding memory
limit and why did it burn through CPU seemily forever like it did?
I don't have code for the process I am running but would it be
because some loop with a malloc() in it that fails just keeps
looping and not cleanly failing? But if that was the case why
would slurmstepd see the memory step exceeded which if it did
why did it not kill the process?
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the slurm-users
mailing list