[slurm-users] Spurious OOM-kills with cgroups on 20.11.8?
rmoye at quantlab.com
Tue Aug 10 21:11:22 UTC 2021
Do you know if the job is actually being killed? We had an issue on an older version of slurm whereby we got OOM errors but the tasks actually completed. The OOM came when the job exited and was a false error.
Also, there are several bug reports open right now about an issue similar to what you have described. You can go to bugs.schedmd.com to look at those bug reports.
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Caron
Sent: Tuesday, August 10, 2021 4:01 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>; Sean Caron <scaron at umich.edu>
Subject: [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?
Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups that ran fine in previous versions like 20.10?
I've had a few reports from users after upgrading maybe six weeks ago that their jobs are getting OOM-killed when they haven't changed anything and the job ran to completion in the past with the same memory specification.
The most recent report I received today involved a job running a "cp" command getting OOM-killed. I have a hard time believing "cp" uses very much memory...
These machines are running various 5.4.x or 5.3.x Linux kernels.
I've had really good luck with the cgroups OOM-killer the last few years from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have to disable it just to clean up these weird issues.
My cgroup.conf file looks like the following:
Should I maybe bump AllowedRamSpace? I don't see how this is any different than just asking the user to re-run the job with a larger memory allocation request. And that doesn't explain why jobs suddenly need more memory before getting OOM-killed than they used to.
The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users