[slurm-users] Seff error with Slurm-18.08.1

Thu Nov 8 23:59:20 MST 2018

Thanks Paddy,

just something learned again ;)

Best
Marcus

On 11/08/2018 05:07 PM, Paddy Doyle wrote:
> Hi all,
>
> It looks like we can use the api to avoid having to manually parse the '2='
> value from the stats{tres_usage_in_max} value.
>
> I've submitted a bug report and patch:
>
> https://bugs.schedmd.com/show_bug.cgi?id=6004
>
> The minimal changes needed would be in the attched seff.patch.
>
> Hope that helps,
>
> Paddy
>
> On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote:
>
>> Hi Miguel,
>>
>>
>> this is because SchedMD changed the stats field. There exists no more
>> rss_max, cmp. line 225 of seff.
>> You need to evaluate the field stats{tres_usage_in_max}, and there the value
>> after '2=', but this is the memory value in bytes instead of kbytes, so this
>> should be divided by 1024 additionally.
>>
>>
>> Best
>> Marcus
>>
>> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote:
>>> Hi and thanks for all your answers and sorry for the delay in my answer.
>>> Yesterday I have installed in the controller machine the Slurm-18.08.3
>>> to check if with this last release the Seff command is working fine. The
>>> behavior has improve but I still receive a error message:
>>>
>>>
>>> # /usr/local/slurm-18.08.3/bin/seff 1694112
>>> *Use of uninitialized value $lmem in numeric lt (<) at
>>> /usr/local/slurm-18.08.3/bin/seff line 130, <DATA> line 624.*
>>> Job ID: 1694112
>>> Cluster: XXXXX
>>> User/Group: XXXXX
>>> State: COMPLETED (exit code 0)
>>> Nodes: 1
>>> Cores per node: 2
>>> CPU Utilized: 01:39:33
>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>> Job Wall-clock time: 00:01:10
>>> Memory Utilized: 0.00 MB (estimated maximum)
>>> Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node)
>>> [root at hydra ~]#
>>>
>>>
>>> And due to this problem,  any job shows me as memory utilized the value
>>> of 0.00 MB.
>>>
>>>
>>> With slurm-17.11.1 is working fine:
>>>
>>>
>>> # /usr/local/slurm-17.11.0/bin/seff 1694112
>>> Job ID: 1694112
>>> Cluster: XXXXX
>>> User/Group: XXXXX
>>> State: COMPLETED (exit code 0)
>>> Nodes: 1
>>> Cores per node: 2
>>> CPU Utilized: 01:39:33
>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>> Job Wall-clock time: 00:01:10
>>> Memory Utilized: 2.44 GB
>>> Memory Efficiency: 62.57% of 3.91 GB
>>> [root at hydra bin]#
>>>
>>>
>>>
>>>
>>> Miguel A. Sánchez Gómez
>>> System Administrator
>>> Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)
>>>
>>> Barcelona Biomedical Research Park (office 4.80)
>>> Doctor Aiguader 88 | 08003 Barcelona (Spain)
>>> Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
>>> e-mail:miguelangel.sanchez at upf.edu
>>> On 11/06/2018 06:30 PM, Mike Cammilleri wrote:
>>>> Thanks for this. We'll try the workaround script. It is not
>>>> mission-critical but our users have gotten accustomed to seeing
>>>> these metrics at the end of each run and its nice to have. We are
>>>> currently doing this in a test VM environment, so by the time we
>>>> actually do the upgrade to the cluster perhaps the fix will be
>>>> available then.
>>>>
>>>>
>>>> Mike Cammilleri
>>>>
>>>> Systems Administrator
>>>>
>>>> Department of Statistics | UW-Madison
>>>>
>>>> 1300 University Ave | Room 1280
>>>> 608-263-6673 | mikec at stat.wisc.edu
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on
>>>> behalf of Chris Samuel <chris at csamuel.org>
>>>> *Sent:* Tuesday, November 6, 2018 5:03 AM
>>>> *To:* slurm-users at lists.schedmd.com
>>>> *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1
>>>> On 6/11/18 7:49 pm, Baker D.J. wrote:
>>>>
>>>>> The good new is that I am assured by SchedMD that the bug has been
>>>> fixed
>>>>> in v18.08.3.
>>>> Looks like it's fixed in this commmit.
>>>>
>>>> commit 3d85c8f9240542d9e6dfb727244e75e449430aac
>>>> Author: Danny Auble <da at schedmd.com>
>>>> Date:   Wed Oct 24 14:10:12 2018 -0600
>>>>
>>>>       Handle symbol resolution errors in the 18.08 slurmdbd.
>>>>
>>>>       Caused by b1ff43429f6426c when moving the slurmdbd agent internals.
>>>>
>>>>       Bug 5882.
>>>>
>>>>
>>>>> Having said that we will probably live with this issue
>>>>> rather than disrupt users with another upgrade so soon .
>>>> An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though,
>>>> should it?  We just flip a symlink and the users see the new binaries,
>>>> libraries, etc immediately, we can then restart daemons as and when we
>>>> need to (in the right order of course, slurmdbd, slurmctld and then
>>>> slurmd's).
>>>>
>>>> All the best,
>>>> Chris
>>>> -- 
>>>>    Chris Samuel  : http://www.csamuel.org/ :  Melbourne, VIC
>>>>
>> -- 
>> Marcus Wagner, Dipl.-Inf.
>>
>> IT Center
>> Abteilung: Systeme und Betrieb
>> RWTH Aachen University
>> Seffenter Weg 23
>> 52074 Aachen
>> Tel: +49 241 80-24383
>> Fax: +49 241 80-624383
>> wagner at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de