[slurm-users] DefMemPerCPU is reset to 1 after upgrade

Tue Jul 10 11:44:21 MDT 2018

Hi,

I ran into this recently after upgrading from 16.05.10 to 17.11.7 and couldn’t run any jobs on any partitions. The only way I got around this was to set this flag on all “NodeName” definitions in slurm.conf: RealMemory=<foo>

Where foo is the total memory of the nodes in MB. I believe the documentation says to keep this number a bit under the total memory, but I’ve set mine to the max and things have been running fine.

Thanks!
John

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Taras Shapovalov <taras.shapovalov at brightcomputing.com>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Tuesday, July 10, 2018 at 12:38 PM
To: "slurm-users at schedmd.com" <slurm-users at schedmd.com>
Subject: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

Hey guys,

When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages:

  slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed
  slurmstepd: error: Exceeded job memory limit
  slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT 2018-06-28T04:40:28 ***

The thing is DefMemPerCPU and DefMemPerNode are set to UNLIMITED, MemLimitEnforce=YES. Users did not set memory limits for their jobs. The error messages above point to the fact that DefMemPerCPU is reset to 1 somehow (my guess).

I noticed the commit that can be related to this:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e

What do you think?

Best regards,

Taras

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180710/eb22301e/attachment.html>