[slurm-users] how to configure correctly node and memory when a script fails with out of memory

Gérard Henry (AMU) gerard.henry at univ-amu.fr
Mon Oct 30 14:46:24 UTC 2023


Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.
Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=30020000M,node=158
for each node: CPUAlloc=32 RealMemory=190000 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID           JobName    Elapsed      NCPUS   TotalCPU    CPUTime 
ReqMem     MaxRSS  MaxDiskRead MaxDiskWrite      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- 
---------- ---------- ------------ ------------ ---------- --------
8500578        analyse5   00:03:04         60   02:57:58   03:04:00 
90000M                                      OUT_OF_ME+    0:125
8500578.bat+      batch   00:03:04         16  46:34.302   00:49:04 
        21465736K        0.23M        0.01M OUT_OF_ME+    0:125
8500578.0         orted   00:03:05         44   02:11:24   02:15:40 
           40952K        0.42M        0.03M  COMPLETED      0:0

i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)

if anybody can help?

thanks in advance

-- 
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille
Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.



More information about the slurm-users mailing list