Hi All,
I am having trouble calculating the real RSS memory usage by some kind
of users' jobs. Which the sacct returned wrong numbers.
Rocky Linux release 8.5, Slurm 21.08
(slurm.conf)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux
The troubling jobs are like:
1. python spawn multithreading 96 threads;
2. Each thread uses SKlearn which again spawns 96 threads using openmp.
Which is obviously over running the node, and I want to address it.
The node has 300GB RAM, but the "sacct" (and seff) reports 1.2TB
MaxRSS(also AveRSS). This does not look correct.
I am suspecting that whether the SLurm+jobacct_gather/linux repeatedly
sums up the memory used by all these threads, multiple counted the
same thing many times.
For the openMP part, maybe it is fine for slurm; while for
python/multithreading, maybe it can not work well with Slurm for
memory accounting?
So, if this is the case, maybe 1.2TB/96= 12GB MaxRSS?
I want to get the right MaxRSS to report to users.
Thanks!
Best,
Feng