<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>It sounds like your confusing job steps and tasks. For an MPI
program, tasks and MPI ranks are the same thing. A slurm job has
multiple steps. A single job step could have only 1 task, while
another step in the same job can use 1,000 tasks. When looking at
the amount of memory for a job, the important number is the
largest value of MaxRSS for all the job steps. Why this important?
Because if you don't request at least this much with your --mem
specification, your job may fail. <br>
</p>
<p>Based on your definition, of aveRSS (I didn't go back and check
the documentation myself), it sounds like you're doing unnecessary
math, since I'm sure Slurm sums up the individual task max. RSS
values for each task to get MaxRSS, and then divides that by the
number of tasks to get the AveRSS. <br>
</p>
<p>What you want is the MaxRSS for the job step with the largest
value of MaxRSS. For example, here's a parallel job I ran earlier
today: <br>
</p>
<pre>$ sacct -u pbisbal -o jobid,jobname,MaxRSS,AveRSS</pre>
<pre> JobID JobName MaxRSS AveRSS </pre>
<pre>------------ ---------- ---------- ---------- </pre>
<pre>1100800 mcnp_test </pre>
<pre>1100800.bat+ batch 20999632K 20999632K </pre>
<pre>1100800.ext+ extern 1060K 964K </pre>
<pre>1100800.0 orted 24014384K 9238477482 </pre>
<p>The real "memory" for this entire job would be 24014384K<br>
</p>
<pre>
</pre>
<pre class="moz-signature" cols="72">Prentice</pre>
<div class="moz-cite-prefix">On 3/9/21 3:41 AM, <a class="moz-txt-link-abbreviated" href="mailto:xiaojinghu93@163.com">xiaojinghu93@163.com</a>
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:1CB5E094-9EC2-4998-B834-5D9C9321D9CA@163.com">
<pre class="moz-quote-pre" wrap="">Hi guys,
I would like to calculate the CPU efficiency and Memory efficiency of slurm jobs.
I am having difficulty calculating the real “memory” a job use.
According to slurm, “maxRSS” means "Maximum resident set size of all tasks in job”. If so, how can I get the memory used by a single job? As far as I am concerned, if I need to know the memory used by a single job/jobstep, I need to sum up the memory used for each task. So I think I should use the “aveRSS” field which gives the "average resident set size of all tasks in job”. If I multiply the “aveRSS” with “task”, I should get the real memory a job/jobstep used.
But I studied the code of the “seff” command and it claims to be equivalent to "sacct -P -n -a --format JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NTasks -j <job_id>”, which means I should use “maxRSS”.
Can anyone give me some explanation on that?
Very grateful for any help.
Thank you!
Regards,
Xiaojing
</pre>
</blockquote>
</body>
</html>