[slurm-users] [EXTERNAL] Re: Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Chad DeWitt
ccdewitt at uncc.edu
Mon Mar 15 19:08:36 UTC 2021
Hi Dave,
Hope you're doing well.
(...very possible you have already done these things...)
Maybe the logs on the compute node (system and slurmd.log) would yield more
info?
Rolling dice, it may also be worth a look for runaway processes or jobs on
that compute node as well as confirm the node is healthy... (No hardware
issues, etc.)
Cheers,
Chad
------------------------------------------------------------
Chad DeWitt, CISSP | University Research Computing
UNC Charlotte *| *Office of OneIT
ccdewitt at uncc.edu *| *https://oneit.uncc.edu
------------------------------------------------------------
On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dwc62 at drexel.edu> wrote:
> [*Caution*: Email from External Sender. Do not click or open links or
> attachments unless you know this sender.]
>
> One possible datapoint: on the node where the job ran, there were two
> slurmstepd processes running, both at 100%CPU even after the job had ended.
>
>
> --
> David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu 215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Chin,David <dwc62 at drexel.edu>
> *Sent:* Monday, March 15, 2021 13:52
> *To:* Slurm-Users List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS
> and MaxVMSize are under the ReqMem value
>
>
> External.
> Hi, all:
>
> I'm trying to understand why a job exited with an error condition. I think
> it was actually terminated by Slurm: job was a Matlab script, and its
> output was incomplete.
>
> Here's sacct output:
>
> JobID JobName User Partition NodeList
> Elapsed State ExitCode ReqMem MaxRSS MaxVMSize
> AllocTRES AllocGRE
> -------------------- ---------- --------- ---------- ---------------
> ---------- ---------- -------- ---------- ---------- ----------
> -------------------------------- --------
> 83387 ProdEmisI+ foob def node001
> 03:34:26 OUT_OF_ME+ 0:125 128Gn
> billing=16,cpu=16,node=1
> 83387.batch batch node001
> 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K
> cpu=16,mem=0,node=1
> 83387.extern extern node001
> 03:34:26 COMPLETED 0:0 128Gn 460K 153196K
> billing=16,cpu=16,node=1
>
> Thanks in advance,
> Dave
>
> --
> David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu 215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> Drexel Internal Data
>
> Drexel Internal Data
>
> Drexel Internal Data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/e6950678/attachment.htm>
More information about the slurm-users
mailing list