[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Chin,David
dwc62 at drexel.edu
Mon Mar 15 18:48:46 UTC 2021
One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended.
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Chin,David <dwc62 at drexel.edu>
Sent: Monday, March 15, 2021 13:52
To: Slurm-Users List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
External.
Hi, all:
I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete.
Here's sacct output:
JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- --------
83387 ProdEmisI+ foob def node001 03:34:26 OUT_OF_ME+ 0:125 128Gn billing=16,cpu=16,node=1
83387.batch batch node001 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1
83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K 153196K billing=16,cpu=16,node=1
Thanks in advance,
Dave
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
Drexel Internal Data
Drexel Internal Data
Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/1afec329/attachment-0001.htm>
More information about the slurm-users
mailing list