[slurm-users] DefMemPerCPU is reset to 1 after upgrade
stolarek.marcin
stolarek.marcin at gmail.com
Tue Jul 10 12:44:50 MDT 2018
What is the change in the commit you're thinking about?
-------- Original message --------From: Taras Shapovalov <taras.shapovalov at brightcomputing.com> Date: 10/07/2018 19:34 (GMT+01:00) To: slurm-users at schedmd.com Subject: [slurm-users] DefMemPerCPU is reset to 1 after upgrade
Hey guys,
When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages:
slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT 2018-06-28T04:40:28 ***
The thing is DefMemPerCPU and DefMemPerNode are set to UNLIMITED, MemLimitEnforce=YES. Users did not set memory limits for their jobs. The error messages above point to the fact that DefMemPerCPU is reset to 1 somehow (my guess).
I noticed the commit that can be related to this:
https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e
What do you think?
Best regards,
Taras
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180710/027ec9d4/attachment-0001.html>
More information about the slurm-users
mailing list