<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body>What is the change in the commit you're thinking about?<div><br></div><div style="font-size:100%;color:#000000"><!-- originalMessage --><div>-------- Original message --------</div><div>From: Taras Shapovalov <taras.shapovalov@brightcomputing.com> </div><div>Date: 10/07/2018 19:34 (GMT+01:00) </div><div>To: slurm-users@schedmd.com </div><div>Subject: [slurm-users] DefMemPerCPU is reset to 1 after upgrade </div><div><br></div></div><div dir="ltr"><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Hey guys,</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages:</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"> slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"> slurmstepd: error: Exceeded job memory limit</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"> slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT 2018-06-28T04:40:28 ***</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><div><span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">The thing is DefMemPerCPU and DefMemPerNode are set to UNLIMITED, MemLimitEnforce=YES. Users did not set memory limits for their jobs. The error messages above point to the fact that DefMemPerCPU is reset to 1 somehow (my guess).</span><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><br style="font-size:small;text-decoration-style:initial;text-decoration-color:initial"><div style="font-size:small;text-decoration-style:initial;text-decoration-color:initial">I noticed the commit that can be related to this:<br><br><a href="https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e" target="_blank" style="color:rgb(17,85,204)">https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e</a><br><br>What do you think?<br><br><br>Best regards,<br><br>Taras</div><br></div></div>
</body></html>