[slurm-users] DefMemPerCPU is reset to 1 after upgrade

Taras Shapovalov taras.shapovalov at brightcomputing.com
Tue Jul 10 11:34:37 MDT 2018


Hey guys,

When we upgraded to 17.11.7, then on some clusters all jobs are killed with
these messages:

  slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being
killed
  slurmstepd: error: Exceeded job memory limit
  slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT
2018-06-28T04:40:28 ***

The thing is DefMemPerCPU and DefMemPerNode are set to UNLIMITED,
MemLimitEnforce=YES. Users did not set memory limits for their jobs. The
error messages above point to the fact that DefMemPerCPU is reset to 1
somehow (my guess).

I noticed the commit that can be related to this:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e

What do you think?


Best regards,

Taras
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180710/b9876f4f/attachment-0001.html>


More information about the slurm-users mailing list