[slurm-users] virtual memory limit exceeded

Noam Bernstein noam.bernstein at nrl.navy.mil
Thu Nov 8 20:16:48 MST 2018


Can anyone shed some light on where the _virtual_ memory limit comes from?  We're getting jobs killed with the message
slurmstepd: error: Step 3664.0 exceeded virtual memory limit (79348101120 > 72638634393), being killed
Is this a limit that's dictated by cgroup.conf or by some srun option (like --mem-per-cpu?  And where could this number come from on a machine that has 64 GB nodes, DefMemPerCPU for the partition is 64 GB / 32 (threads), and cgroup.conf has AllowedSwapSpace=75.  

And a couple of related questions:
1. If I define DefMemPerCPU in the partition line, and the job doesn't request anything else, what memory measure should expect this to be the limit on? RSS?

2. In general, what's the right way to disable swapping by default, but allow individual jobs to request to be allowed to swap?

									thanks,
									Noam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181108/98250977/attachment.html>


More information about the slurm-users mailing list