[slurm-users] how to configure correctly node and memory when a script fails with out of memory

Rémi Palancher remi at rackslab.io
Wed Nov 1 09:19:06 UTC 2023


Hello Gérard,

> On 30/10/2023 15:46, Gérard Henry (AMU) wrote:
>> Hello all,
>>>> when it fails, sacct gives the follwing information:
>> JobID           JobName    Elapsed      NCPUS   TotalCPU    CPUTime
>> ReqMem     MaxRSS  MaxDiskRead MaxDiskWrite      State ExitCode
>> ------------ ---------- ---------- ---------- ---------- ----------
>> ---------- ---------- ------------ ------------ ---------- --------
>> 8500578        analyse5   00:03:04         60   02:57:58   03:04:00
>> 90000M                                      OUT_OF_ME+    0:125
>> 8500578.bat+      batch   00:03:04         16  46:34.302   00:49:04
>>          21465736K        0.23M        0.01M OUT_OF_ME+    0:125
>> 8500578.0         orted   00:03:05         44   02:11:24   02:15:40
>>             40952K        0.42M        0.03M  COMPLETED      0:0
>>
>> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus
>> and 1500M per cpu (24M)

Due to job accounting sampling intervals, tasks whose memory consumption 
increase quickly might not be properly reported by `sacct`. Default 
JobAcctGatherFrequency is 30 seconds so your batch step may have reached 
its limit in the 30 seconds time frame following the 21GB measure.

You can probably retrieve the exact memory consumption in the nodes 
kernel logs when the tasks have been killed.

Le 30/10/2023 à 15:53, Gérard Henry a écrit :
 > if i try to request just nodes and memory, for instance:
 > #SBATCH -N 2
 > #SBATCH --mem=0
 > to resquest all memory on a node, and 2nodes seem sufficient for a
 > program that consumes 100GB, i ot this error:
 > sbatch: error: CPU count per node can not be satisfied
 > sbatch: error: Batch job submission failed: Requested node configuration
 > is not available

Do you have a MaxMemPerCPU on the cluster or on the partition? If this 
value is too low, this could make the job fail due to CPU count limit.

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




More information about the slurm-users mailing list