[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Chin,David
dwc62 at drexel.edu
Mon Mar 15 18:15:42 UTC 2021
Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem.
Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Paul Edmon <pedmon at cfa.harvard.edu>
Sent: Monday, March 15, 2021 14:02
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
External.
One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM. sacct just can't pick up a sudden memory spike like that and even if it did it would not correctly record the peak memory because the job was terminated prior to that point.
-Paul Edmon-
On 3/15/2021 1:52 PM, Chin,David wrote:
Hi, all:
I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete.
Here's sacct output:
JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- --------
83387 ProdEmisI+ foob def node001 03:34:26 OUT_OF_ME+ 0:125 128Gn billing=16,cpu=16,node=1
83387.batch batch node001 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1
83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K 153196K billing=16,cpu=16,node=1
Thanks in advance,
Dave
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu<mailto:dwc62 at drexel.edu> 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu<mailto:urcf-support at drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
Drexel Internal Data
Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/287293df/attachment-0001.htm>
More information about the slurm-users
mailing list