[slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Sean Crosby scrosby at unimelb.edu.au
Mon Mar 15 19:22:08 UTC 2021


What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you
using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 16 Mar 2021 at 04:52, Chin,David <dwc62 at drexel.edu> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
> Hi, all:
>
> I'm trying to understand why a job exited with an error condition. I think
> it was actually terminated by Slurm: job was a Matlab script, and its
> output was incomplete.
>
> Here's sacct output:
>
>                JobID    JobName      User  Partition        NodeList
>  Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize
>          AllocTRES AllocGRE
> -------------------- ---------- --------- ---------- ---------------
> ---------- ---------- -------- ---------- ---------- ----------
> -------------------------------- --------
>                83387 ProdEmisI+      foob        def         node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn
> billing=16,cpu=16,node=1
>          83387.batch      batch                              node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn   1617705K   7880672K
>  cpu=16,mem=0,node=1
>         83387.extern     extern                              node001
> 03:34:26  COMPLETED      0:0      128Gn       460K    153196K
> billing=16,cpu=16,node=1
>
> Thanks in advance,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> Drexel Internal Data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210316/b6866f07/attachment-0001.htm>


More information about the slurm-users mailing list