[slurm-users] Seff error with Slurm-18.08.1

Thu Nov 8 09:07:06 MST 2018

Hi all,

It looks like we can use the api to avoid having to manually parse the '2='
value from the stats{tres_usage_in_max} value.

I've submitted a bug report and patch:

https://bugs.schedmd.com/show_bug.cgi?id=6004

The minimal changes needed would be in the attched seff.patch.

Hope that helps,

Paddy

On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote:

> Hi Miguel,
> 
> 
> this is because SchedMD changed the stats field. There exists no more
> rss_max, cmp. line 225 of seff.
> You need to evaluate the field stats{tres_usage_in_max}, and there the value
> after '2=', but this is the memory value in bytes instead of kbytes, so this
> should be divided by 1024 additionally.
> 
> 
> Best
> Marcus
> 
> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote:
> > 
> > Hi and thanks for all your answers and sorry for the delay in my answer.
> > Yesterday I have installed in the controller machine the Slurm-18.08.3
> > to check if with this last release the Seff command is working fine. The
> > behavior has improve but I still receive a error message:
> > 
> > 
> > # /usr/local/slurm-18.08.3/bin/seff 1694112
> > *Use of uninitialized value $lmem in numeric lt (<) at
> > /usr/local/slurm-18.08.3/bin/seff line 130, <DATA> line 624.*
> > Job ID: 1694112
> > Cluster: XXXXX
> > User/Group: XXXXX
> > State: COMPLETED (exit code 0)
> > Nodes: 1
> > Cores per node: 2
> > CPU Utilized: 01:39:33
> > CPU Efficiency: 4266.43% of 00:02:20 core-walltime
> > Job Wall-clock time: 00:01:10
> > Memory Utilized: 0.00 MB (estimated maximum)
> > Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node)
> > [root at hydra ~]#
> > 
> > 
> > And due to this problem,  any job shows me as memory utilized the value
> > of 0.00 MB.
> > 
> > 
> > With slurm-17.11.1 is working fine:
> > 
> > 
> > # /usr/local/slurm-17.11.0/bin/seff 1694112
> > Job ID: 1694112
> > Cluster: XXXXX
> > User/Group: XXXXX
> > State: COMPLETED (exit code 0)
> > Nodes: 1
> > Cores per node: 2
> > CPU Utilized: 01:39:33
> > CPU Efficiency: 4266.43% of 00:02:20 core-walltime
> > Job Wall-clock time: 00:01:10
> > Memory Utilized: 2.44 GB
> > Memory Efficiency: 62.57% of 3.91 GB
> > [root at hydra bin]#
> > 
> > 
> > 
> > 
> > Miguel A. Sánchez Gómez
> > System Administrator
> > Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)
> > 
> > Barcelona Biomedical Research Park (office 4.80)
> > Doctor Aiguader 88 | 08003 Barcelona (Spain)
> > Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
> > e-mail:miguelangel.sanchez at upf.edu
> > On 11/06/2018 06:30 PM, Mike Cammilleri wrote:
> > > 
> > > Thanks for this. We'll try the workaround script. It is not
> > > mission-critical but our users have gotten accustomed to seeing
> > > these metrics at the end of each run and its nice to have. We are
> > > currently doing this in a test VM environment, so by the time we
> > > actually do the upgrade to the cluster perhaps the fix will be
> > > available then.
> > > 
> > > 
> > > Mike Cammilleri
> > > 
> > > Systems Administrator
> > > 
> > > Department of Statistics | UW-Madison
> > > 
> > > 1300 University Ave | Room 1280
> > > 608-263-6673 | mikec at stat.wisc.edu
> > > 
> > > 
> > > 
> > > ------------------------------------------------------------------------
> > > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on
> > > behalf of Chris Samuel <chris at csamuel.org>
> > > *Sent:* Tuesday, November 6, 2018 5:03 AM
> > > *To:* slurm-users at lists.schedmd.com
> > > *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1
> > > On 6/11/18 7:49 pm, Baker D.J. wrote:
> > > 
> > > > The good new is that I am assured by SchedMD that the bug has been
> > > fixed
> > > > in v18.08.3.
> > > 
> > > Looks like it's fixed in this commmit.
> > > 
> > > commit 3d85c8f9240542d9e6dfb727244e75e449430aac
> > > Author: Danny Auble <da at schedmd.com>
> > > Date:   Wed Oct 24 14:10:12 2018 -0600
> > > 
> > >      Handle symbol resolution errors in the 18.08 slurmdbd.
> > > 
> > >      Caused by b1ff43429f6426c when moving the slurmdbd agent internals.
> > > 
> > >      Bug 5882.
> > > 
> > > 
> > > > Having said that we will probably live with this issue
> > > > rather than disrupt users with another upgrade so soon .
> > > 
> > > An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though,
> > > should it?  We just flip a symlink and the users see the new binaries,
> > > libraries, etc immediately, we can then restart daemons as and when we
> > > need to (in the right order of course, slurmdbd, slurmctld and then
> > > slurmd's).
> > > 
> > > All the best,
> > > Chris
> > > -- 
> > >   Chris Samuel  : http://www.csamuel.org/ :  Melbourne, VIC
> > > 
> > 
> 
> -- 
> Marcus Wagner, Dipl.-Inf.
> 
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
> 

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
-------------- next part --------------

--- a/contribs/seff/seff
+++ b/contribs/seff/seff
@@ -126,10 +126,13 @@ for my $step (@{$job->{'steps'}}) {
     $tot_cpu_sec += $step->{'tot_cpu_sec'};
     $tot_cpu_usec += $step->{'tot_cpu_usec'};

-    my $lmem = $step->{'stats'}{'rss_max'};
-    if ($mem < $lmem) {
-        $mem = $lmem;
-        $ntasks = $step->{'ntasks'};
+    if (exists $step->{'stats'} && exists $step->{'stats'}{'tres_usage_in_max'}) {
+        my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM);
+
+        if ($mem < $lmem) {
+            $mem = $lmem;
+            $ntasks = $step->{'ntasks'};
+        }
     }
 }
 my $cput = $tot_cpu_sec + int(($tot_cpu_usec / 1000000) + 0.5);