[slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

John DeSantis desantis at usf.edu
Wed Feb 14 07:50:00 MST 2018


Geert,

Considering the the following response from Loris:

> Maybe once in a while a simulation really does just use more memory
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.

This can certainly happen!  

I'd suggest profiling the job(s) in question; perhaps a loop of `ps`
with the appropriate output modifiers, e.g. 'rss' (and vsz if you're
tracking virtual memory usage).  

We've seen jobs that which will terminate after several hours of run
time because their memory usage spiked during a JobAcctGatherFrequency
sampling interval (every 30 seconds, adjusted within slurm.conf).

John DeSantis


On Wed, 14 Feb 2018 13:05:41 +0100
Loris Bennett <loris.bennett at fu-berlin.de> wrote:

> Geert Kapteijns <ghkapteijns at gmail.com> writes:
> 
> > Hi everyone,
> >
> > I’m running into out-of-memory errors when I specify an array job.
> > Needless to say, 100M should be more than enough, and increasing
> > the allocated memory to 1G doesn't solve the problem. I call my
> > script as follows: sbatch --array=100-199 run_batch_job.
> > run_batch_job contains
> >
> > #!/bin/env bash
> > #SBATCH --partition=lln
> > #SBATCH --output=/home/user/outs/%x.out.%a
> > #SBATCH --error=/home/user/outs/%x.err.%a
> > #SBATCH --cpus-per-task=1
> > #SBATCH --mem-per-cpu=100M
> > #SBATCH --time=2-00:00:00
> >
> > srun my_program.out $SLURM_ARRAY_TASK_ID
> >
> > Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried
> > the following:
> >
> > #SBATCH --mem=100M
> > #SBATCH --ntasks=1  # Number of cores
> > #SBATCH --nodes=1  # All cores on one machine
> >
> > But in both cases for some of the runs, I get the error:
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> > srun: error: obelix-cn002: task 0: Out Of Memory
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> > I’ve also posted the question on stackoverflow. Does anyone know
> > what is happening here?
> 
> Maybe once in a while a simulation really does just use more memory
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.
> 
> Regards
> 
> Loris
> 




More information about the slurm-users mailing list