[slurm-users] how to configure correctly node and memory when a script fails with out of memory
Rémi Palancher
remi at rackslab.io
Wed Nov 1 09:19:06 UTC 2023
Hello Gérard,
> On 30/10/2023 15:46, Gérard Henry (AMU) wrote:
>> Hello all,
>> …
>> when it fails, sacct gives the follwing information:
>> JobID JobName Elapsed NCPUS TotalCPU CPUTime
>> ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode
>> ------------ ---------- ---------- ---------- ---------- ----------
>> ---------- ---------- ------------ ------------ ---------- --------
>> 8500578 analyse5 00:03:04 60 02:57:58 03:04:00
>> 90000M OUT_OF_ME+ 0:125
>> 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04
>> 21465736K 0.23M 0.01M OUT_OF_ME+ 0:125
>> 8500578.0 orted 00:03:05 44 02:11:24 02:15:40
>> 40952K 0.42M 0.03M COMPLETED 0:0
>>
>> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus
>> and 1500M per cpu (24M)
Due to job accounting sampling intervals, tasks whose memory consumption
increase quickly might not be properly reported by `sacct`. Default
JobAcctGatherFrequency is 30 seconds so your batch step may have reached
its limit in the 30 seconds time frame following the 21GB measure.
You can probably retrieve the exact memory consumption in the nodes
kernel logs when the tasks have been killed.
Le 30/10/2023 à 15:53, Gérard Henry a écrit :
> if i try to request just nodes and memory, for instance:
> #SBATCH -N 2
> #SBATCH --mem=0
> to resquest all memory on a node, and 2nodes seem sufficient for a
> program that consumes 100GB, i ot this error:
> sbatch: error: CPU count per node can not be satisfied
> sbatch: error: Batch job submission failed: Requested node configuration
> is not available
Do you have a MaxMemPerCPU on the cluster or on the partition? If this
value is too low, this could make the job fail due to CPU count limit.
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/
More information about the slurm-users
mailing list