[slurm-users] Seff error with Slurm-18.08.1

Miguel A. Sánchez miguelangel.sanchez at upf.edu
Fri Nov 9 08:00:00 MST 2018


Oh, thanks Paddy for your patch, it works very well !!

Miguel A. Sánchez Gómez
System Administrator
Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)

Barcelona Biomedical Research Park (office 4.80)
Doctor Aiguader 88 | 08003 Barcelona (Spain)
Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
e-mail: miguelangel.sanchez at upf.edu

On 11/09/2018 07:59 AM, Marcus Wagner wrote:
> Thanks Paddy,
>
> just something learned again ;)
>
>
> Best
> Marcus
>
> On 11/08/2018 05:07 PM, Paddy Doyle wrote:
>> Hi all,
>>
>> It looks like we can use the api to avoid having to manually parse
>> the '2='
>> value from the stats{tres_usage_in_max} value.
>>
>> I've submitted a bug report and patch:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=6004
>>
>> The minimal changes needed would be in the attched seff.patch.
>>
>> Hope that helps,
>>
>> Paddy
>>
>> On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote:
>>
>>> Hi Miguel,
>>>
>>>
>>> this is because SchedMD changed the stats field. There exists no more
>>> rss_max, cmp. line 225 of seff.
>>> You need to evaluate the field stats{tres_usage_in_max}, and there
>>> the value
>>> after '2=', but this is the memory value in bytes instead of kbytes,
>>> so this
>>> should be divided by 1024 additionally.
>>>
>>>
>>> Best
>>> Marcus
>>>
>>> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote:
>>>> Hi and thanks for all your answers and sorry for the delay in my
>>>> answer.
>>>> Yesterday I have installed in the controller machine the Slurm-18.08.3
>>>> to check if with this last release the Seff command is working
>>>> fine. The
>>>> behavior has improve but I still receive a error message:
>>>>
>>>>
>>>> # /usr/local/slurm-18.08.3/bin/seff 1694112
>>>> *Use of uninitialized value $lmem in numeric lt (<) at
>>>> /usr/local/slurm-18.08.3/bin/seff line 130, <DATA> line 624.*
>>>> Job ID: 1694112
>>>> Cluster: XXXXX
>>>> User/Group: XXXXX
>>>> State: COMPLETED (exit code 0)
>>>> Nodes: 1
>>>> Cores per node: 2
>>>> CPU Utilized: 01:39:33
>>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>>> Job Wall-clock time: 00:01:10
>>>> Memory Utilized: 0.00 MB (estimated maximum)
>>>> Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node)
>>>> [root at hydra ~]#
>>>>
>>>>
>>>> And due to this problem,  any job shows me as memory utilized the
>>>> value
>>>> of 0.00 MB.
>>>>
>>>>
>>>> With slurm-17.11.1 is working fine:
>>>>
>>>>
>>>> # /usr/local/slurm-17.11.0/bin/seff 1694112
>>>> Job ID: 1694112
>>>> Cluster: XXXXX
>>>> User/Group: XXXXX
>>>> State: COMPLETED (exit code 0)
>>>> Nodes: 1
>>>> Cores per node: 2
>>>> CPU Utilized: 01:39:33
>>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>>> Job Wall-clock time: 00:01:10
>>>> Memory Utilized: 2.44 GB
>>>> Memory Efficiency: 62.57% of 3.91 GB
>>>> [root at hydra bin]#
>>>>
>>>>
>>>>
>>>>
>>>> Miguel A. Sánchez Gómez
>>>> System Administrator
>>>> Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)
>>>>
>>>> Barcelona Biomedical Research Park (office 4.80)
>>>> Doctor Aiguader 88 | 08003 Barcelona (Spain)
>>>> Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
>>>> e-mail:miguelangel.sanchez at upf.edu
>>>> On 11/06/2018 06:30 PM, Mike Cammilleri wrote:
>>>>> Thanks for this. We'll try the workaround script. It is not
>>>>> mission-critical but our users have gotten accustomed to seeing
>>>>> these metrics at the end of each run and its nice to have. We are
>>>>> currently doing this in a test VM environment, so by the time we
>>>>> actually do the upgrade to the cluster perhaps the fix will be
>>>>> available then.
>>>>>
>>>>>
>>>>> Mike Cammilleri
>>>>>
>>>>> Systems Administrator
>>>>>
>>>>> Department of Statistics | UW-Madison
>>>>>
>>>>> 1300 University Ave | Room 1280
>>>>> 608-263-6673 | mikec at stat.wisc.edu
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on
>>>>> behalf of Chris Samuel <chris at csamuel.org>
>>>>> *Sent:* Tuesday, November 6, 2018 5:03 AM
>>>>> *To:* slurm-users at lists.schedmd.com
>>>>> *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1
>>>>> On 6/11/18 7:49 pm, Baker D.J. wrote:
>>>>>
>>>>>> The good new is that I am assured by SchedMD that the bug has been
>>>>> fixed
>>>>>> in v18.08.3.
>>>>> Looks like it's fixed in this commmit.
>>>>>
>>>>> commit 3d85c8f9240542d9e6dfb727244e75e449430aac
>>>>> Author: Danny Auble <da at schedmd.com>
>>>>> Date:   Wed Oct 24 14:10:12 2018 -0600
>>>>>
>>>>>       Handle symbol resolution errors in the 18.08 slurmdbd.
>>>>>
>>>>>       Caused by b1ff43429f6426c when moving the slurmdbd agent
>>>>> internals.
>>>>>
>>>>>       Bug 5882.
>>>>>
>>>>>
>>>>>> Having said that we will probably live with this issue
>>>>>> rather than disrupt users with another upgrade so soon .
>>>>> An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though,
>>>>> should it?  We just flip a symlink and the users see the new
>>>>> binaries,
>>>>> libraries, etc immediately, we can then restart daemons as and
>>>>> when we
>>>>> need to (in the right order of course, slurmdbd, slurmctld and then
>>>>> slurmd's).
>>>>>
>>>>> All the best,
>>>>> Chris
>>>>> -- 
>>>>>    Chris Samuel  : http://www.csamuel.org/ :  Melbourne, VIC
>>>>>
>>> -- 
>>> Marcus Wagner, Dipl.-Inf.
>>>
>>> IT Center
>>> Abteilung: Systeme und Betrieb
>>> RWTH Aachen University
>>> Seffenter Weg 23
>>> 52074 Aachen
>>> Tel: +49 241 80-24383
>>> Fax: +49 241 80-624383
>>> wagner at itc.rwth-aachen.de
>>> www.itc.rwth-aachen.de
>>>
>




More information about the slurm-users mailing list