[slurm-users] [EXTERNAL] Re: Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Chad DeWitt ccdewitt at uncc.edu
Mon Mar 15 19:08:36 UTC 2021


Hi Dave,

Hope you're doing well.

(...very possible you have already done these things...)

Maybe the logs on the compute node (system and slurmd.log) would yield more
info?

Rolling dice, it may also be worth a look for runaway processes or jobs on
that compute node as well as confirm the node is healthy... (No hardware
issues, etc.)

Cheers,
Chad

------------------------------------------------------------

Chad DeWitt, CISSP | University Research Computing

UNC Charlotte *| *Office of OneIT

ccdewitt at uncc.edu *| *https://oneit.uncc.edu

------------------------------------------------------------




On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dwc62 at drexel.edu> wrote:

> [*Caution*: Email from External Sender. Do not click or open links or
> attachments unless you know this sender.]
>
> One possible datapoint: on the node where the job ran, there were two
> slurmstepd processes running, both at 100%CPU even after the job had ended.
>
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Chin,David <dwc62 at drexel.edu>
> *Sent:* Monday, March 15, 2021 13:52
> *To:* Slurm-Users List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS
> and MaxVMSize are under the ReqMem value
>
>
> External.
> Hi, all:
>
> I'm trying to understand why a job exited with an error condition. I think
> it was actually terminated by Slurm: job was a Matlab script, and its
> output was incomplete.
>
> Here's sacct output:
>
>                JobID    JobName      User  Partition        NodeList
>  Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize
>          AllocTRES AllocGRE
> -------------------- ---------- --------- ---------- ---------------
> ---------- ---------- -------- ---------- ---------- ----------
> -------------------------------- --------
>                83387 ProdEmisI+      foob        def         node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn
> billing=16,cpu=16,node=1
>          83387.batch      batch                              node001
> 03:34:26 OUT_OF_ME+    0:125      128Gn   1617705K   7880672K
>  cpu=16,mem=0,node=1
>         83387.extern     extern                              node001
> 03:34:26  COMPLETED      0:0      128Gn       460K    153196K
> billing=16,cpu=16,node=1
>
> Thanks in advance,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> Drexel Internal Data
>
> Drexel Internal Data
>
> Drexel Internal Data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/e6950678/attachment.htm>


More information about the slurm-users mailing list