[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Paul Edmon pedmon at cfa.harvard.edu
Mon Mar 15 18:02:54 UTC 2021


One should keep in mind that sacct results for memory usage are not 
accurate for Out Of Memory (OoM) jobs.  This is due to the fact that the 
job is typically terminated prior to next sacct polling period, and also 
terminated prior to it reaching full memory allocation.  Thus I wouldn't 
trust any of the results with regards to memory usage if the job is 
terminated by OoM.  sacct just can't pick up a sudden memory spike like 
that and even if it did  it would not correctly record the peak memory 
because the job was terminated prior to that point.


-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:
> Hi, all:
>
> I'm trying to understand why a job exited with an error condition. I 
> think it was actually terminated by Slurm: job was a Matlab script, 
> and its output was incomplete.
>
> Here's sacct output:
>
>                JobID    JobName      User  Partition  NodeList   
>  Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize         
>                AllocTRES AllocGRE
> -------------------- ---------- --------- ---------- --------------- 
> ---------- ---------- -------- ---------- ---------- ---------- 
> -------------------------------- --------
>                83387 ProdEmisI+      foob        def   node001   
> 03:34:26 OUT_OF_ME+    0:125      128Gn                     
> billing=16,cpu=16,node=1
>          83387.batch      batch  node001   03:34:26 OUT_OF_ME+   
>  0:125      128Gn   1617705K   7880672K              cpu=16,mem=0,node=1
>         83387.extern     extern  node001   03:34:26  COMPLETED     
>  0:0      128Gn       460K  153196K         billing=16,cpu=16,node=1
>
> Thanks in advance,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu  215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> Drexel Internal Data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/ea346b45/attachment-0001.htm>


More information about the slurm-users mailing list