<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>Dear Daryl,</p>
<p><br>
</p>
<p>I once posed the same question, and got a dear answer here in the forum some while ago. So, I just forward it approximately.</p>
<p><br>
</p>
<p>RSS appears to include double counting of memory that is occupied by shared libraries. I was proposed to switch to PSS</p>
<p><a href="https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherParams" class="OWAAutoLink">https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherParams</a></p>
<p><a href="https://www.sobyte.net/post/2022-04/pss-uss-rss/" class="OWAAutoLink">https://www.sobyte.net/post/2022-04/pss-uss-rss/</a></p>
<p><br>
</p>
<p>Hope this helps any further (and I hope I understood the answer to my question thus correctly ;) ).<br>
</p>
<p><br>
</p>
<p>Kind regards.</p>
<p>Martin<br>
</p>
<p><br>
</p>
<br>
<br>
<div style="color: rgb(0, 0, 0);">
<div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>Von:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> im Auftrag von Daryl Roche <droche@bcchr.ca><br>
<b>Gesendet:</b> Freitag, 16. Dezember 2022 01:43<br>
<b>An:</b> 'slurm-users@lists.schedmd.com'<br>
<b>Betreff:</b> [slurm-users] seff MaxRSS Above 100 Percent?</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">Hey All, <br>
<br>
I was just hoping to find out if anyone can explain how a job running on a single node was able to have a MaxRSS of 240% reported by seff. Below is some specifics about the job that was run. We're using slurm 19.05.7 on CentOS 8.2/<br>
.<br>
[root@hpc-node01 ~]# scontrol show jobid -dd 97036<br>
JobId=97036 JobName=jobbie.sh<br>
UserId=username(012344321) GroupId=domain users(214400513) MCS_label=N/A<br>
Priority=4294842062 Nice=0 Account=(null) QOS=normal<br>
JobState=COMPLETED Reason=None Dependency=(null)<br>
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>
DerivedExitCode=0:0<br>
RunTime=00:22:35 TimeLimit=8-08:00:00 TimeMin=N/A<br>
SubmitTime=2022-12-02T14:27:47 EligibleTime=2022-12-02T14:27:48<br>
AccrueTime=2022-12-02T14:27:48<br>
StartTime=2022-12-02T14:27:48 EndTime=2022-12-02T14:50:23 Deadline=N/A<br>
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-02T14:27:48<br>
Partition=defq AllocNode:Sid=hpc-node01:3921213<br>
ReqNodeList=(null) ExcNodeList=(null)<br>
NodeList=hpc-node11<br>
BatchHost=hpc-node11<br>
NumNodes=1 NumCPUs=40 NumTasks=0 CPUs/Task=40 ReqB:S:C:T=0:0:*:*<br>
TRES=cpu=40,mem=350G,node=1,billing=40<br>
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>
Nodes=hpc-node11 CPU_IDs=0-79 Mem=358400 GRES=<br>
MinCPUsNode=40 MinMemoryNode=350G MinTmpDiskNode=0<br>
Features=(null) DelayBoot=00:00:00<br>
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br>
Command=/path/to/scratch/jobbie.sh<br>
WorkDir=/path/to/scratch<br>
StdErr=/path/to/scratch/jobbie.sh-97036.error<br>
StdIn=/dev/null<br>
StdOut=/path/to/scratch/jobbie.sh-97036.out<br>
Power=<br>
<br>
[root@hpc-node01 ~]# seff 97036<br>
Job ID: 97036<br>
Cluster: slurm<br>
Use of uninitialized value $user in concatenation (.) or string at /cm/shared/apps/slurm/current/bin/seff line 154, <DATA> line 604.<br>
User/Group: /domain users<br>
State: COMPLETED (exit code 0)<br>
Nodes: 1<br>
Cores per node: 40<br>
CPU Utilized: 04:43:36<br>
CPU Efficiency: 31.39% of 15:03:20 core-walltime Job Wall-clock time: 00:22:35 Memory Utilized: 840.21 GB Memory Efficiency: 240.06% of 350.00 GB<br>
<br>
[root@hpc-node11 ~]# free -m<br>
total used free shared buff/cache available<br>
Mem: 385587 2761 268544 7891 114281 371865<br>
Swap: 0 0 0<br>
<br>
<br>
[root@hpc-node11 ~]# cat /path/to/scratch/jobbie.sh #!/bin/bash<br>
<br>
#SBATCH --mail-user=username@bcchr.ca<br>
#SBATCH --mail-type=ALL<br>
<br>
## CPU Usage<br>
#SBATCH --mem=350G<br>
#SBATCH --cpus-per-task=40<br>
#SBATCH --time=200:00:00<br>
#SBATCH --nodes=1<br>
<br>
## Output and Stderr<br>
#SBATCH --output=%x-%j.out<br>
#SBATCH --error=%x-%j.error<br>
<br>
<br>
source /path/to/tools/Miniconda3/opt/miniconda3/etc/profile.d/conda.sh<br>
conda activate nanomethphase<br>
<br>
<br>
# Working dir<br>
Working_Dir=/path/to/scratch/<br>
<br>
# Case methyl sample<br>
Case=$Working_Dir/Case/methylation_frequency.tsv<br>
<br>
# Control, here using a dir with both parents Control=$Working_Dir/Control/<br>
<br>
# DMA call<br>
/path/to/tools/NanoMethPhase/NanoMethPhase/nanomethphase.py dma \<br>
--case $Case \<br>
--control $Control \<br>
--columns 1,2,5,7 \<br>
--out_prefix DH0808_Proband_vs_Controls_DMA \<br>
--out_dir $Working_Dir<br>
<br>
<br>
<br>
<br>
Daryl Roche<br>
<br>
</div>
</span></font></div>
</div>
</body>
</html>