[slurm-users] sshare is crashing

Holger Naundorf naundorf at rz.uni-kiel.de
Fri Aug 21 10:40:22 UTC 2020


Am 11.08.2020 um 20:55 schrieb Richard Lefebvre:
> Hi,
>
> The command "sshare -l" is crashing. I isolated the problem to an 
> account which is causing the problem. The problem seems to be an 
> extremely large LevelFS in the order of 4.8x10e16. I can see the value 
> if I add the "-p" option. Is there a way to fix the account?
>
I have seen this as well - I did not bother to trace it in the code, but 
I would guess its some underflow problem (when the raw usage of the 
account decays toward zero the LevelFS gets ever bigger...)

It can be fixed by just resetting the account to 'true' zero usage

(sacctmgr modify account NAME set rawusage=0)

When the next FS recalculation kicks in the huge LevelFS resets to 'inf' 
and the problem goes away.

Regards,

Holger N.



> Below are the results of the 2 commands with the "-p" and next the 
> crashed command:
>
> sshare -l -p --account=group001_cpu
> Account|User|RawShares|NormShares|RawUsage|NormUsage|EffectvUsage|FairShare|LevelFS|GrpTRESMins|TRESRunMins|
> group001_cpu||650216|0.003724|0|0.000000|0.000000||48285673640776424.000000||cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0|
>
> sshare -l --account=group001_cpu
>              Account       User  RawShares  NormShares  RawUsage   
> NormUsage  EffectvUsage  FairShare    LevelFS                 
>  GrpTRESMins                    TRESRunMins
> -------------------- ---------- ---------- ----------- ----------- 
> ----------- ------------- ---------- ---------- 
> ------------------------------ ------------------------------
> *** Error in `sshare': free(): invalid next size (fast): 
> 0x0000000000eff280 ***
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x81679)[0x7efd0e82a679]
> /opt/software/slurm/lib64/slurm/libslurmfull.so(slurm_xfree+0x1d)[0x7efd0fcb9009]
> /opt/software/slurm/lib64/slurm/libslurmfull.so(print_fields_double+0x2d6)[0x7efd0fc02a08]
> sshare(process+0x51c)[0x4024c9]
> sshare[0x40292c]
> sshare(main+0xa2d)[0x40337f]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7efd0e7cb505]
> sshare[0x401da9]
> ======= Memory map: ========
> 00400000-00405000 r-xp 00000000 00:2f 51577             
>  /opt/software/slurm/bin/sshare
> 00604000-00605000 r--p 00004000 00:2f 51577             
>  /opt/software/slurm/bin/sshare
> 00605000-00606000 rw-p 00005000 00:2f 51577             
>  /opt/software/slurm/bin/sshare
> 00ee3000-00f23000 rw-p 00000000 00:00 0              [heap]
> 7efd08000000-7efd08021000 rw-p 00000000 00:00 0
> 7efd08021000-7efd0c000000 ---p 00000000 00:00 0
> 7efd0d564000-7efd0d579000 r-xp 00000000 00:24 61849             
>  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 7efd0d579000-7efd0d778000 ---p 00015000 00:24 61849             
>  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 7efd0d778000-7efd0d779000 r--p 00014000 00:24 61849             
>  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 7efd0d779000-7efd0d77a000 rw-p 00015000 00:24 61849             
>  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 7efd0d77a000-7efd0d783000 r-xp 00000000 00:24 66799             
>  /usr/lib64/libmunge.so.2.0.0
> 7efd0d783000-7efd0d982000 ---p 00009000 00:24 66799             
>  /usr/lib64/libmunge.so.2.0.0
> 7efd0d982000-7efd0d983000 r--p 00008000 00:24 66799             
>  /usr/lib64/libmunge.so.2.0.0
> 7efd0d983000-7efd0d984000 rw-p 00009000 00:24 66799             
>  /usr/lib64/libmunge.so.2.0.0
> 7efd0d984000-7efd0d987000 r-xp 00000000 00:2f 51448             
>  /opt/software/slurm/lib64/slurm/auth_munge.so
> 7efd0d987000-7efd0db86000 ---p 00003000 00:2f 51448             
>  /opt/software/slurm/lib64/slurm/auth_munge.so
> 7efd0db86000-7efd0db87000 r--p 00002000 00:2f 51448             
>  /opt/software/slurm/lib64/slurm/auth_munge.so
> 7efd0db87000-7efd0db88000 rw-p 00003000 00:2f 51448             
>  /opt/software/slurm/lib64/slurm/auth_munge.so
> 7efd0db88000-7efd0e38d000 r--s 00000000 00:24 191641             
> /var/lib/sss/mc/passwd
> 7efd0e38d000-7efd0e395000 r-xp 00000000 00:24 66184             
>  /usr/lib64/libnss_sss.so.2
> 7efd0e395000-7efd0e594000 ---p 00008000 00:24 66184             
>  /usr/lib64/libnss_sss.so.2
> 7efd0e594000-7efd0e595000 r--p 00007000 00:24 66184             
>  /usr/lib64/libnss_sss.so.2
> 7efd0e595000-7efd0e596000 rw-p 00008000 00:24 66184             
>  /usr/lib64/libnss_sss.so.2
> 7efd0e596000-7efd0e5a2000 r-xp 00000000 00:24 62229             
>  /usr/lib64/libnss_files-2.17.so <http://libnss_files-2.17.so>
> 7efd0e5a2000-7efd0e7a1000 ---p 0000c000 00:24 62229             
>  /usr/lib64/libnss_files-2.17.so <http://libnss_files-2.17.so>
> 7efd0e7a1000-7efd0e7a2000 r--p 0000b000 00:24 62229             
>  /usr/lib64/libnss_files-2.17.so <http://libnss_files-2.17.so>
> 7efd0e7a2000-7efd0e7a3000 rw-p 0000c000 00:24 62229             
>  /usr/lib64/libnss_files-2.17.so <http://libnss_files-2.17.so>
> 7efd0e7a3000-7efd0e7a9000 rw-p 00000000 00:00 0
> 7efd0e7a9000-7efd0e96c000 r-xp 00000000 00:24 62154             
>  /usr/lib64/libc-2.17.so <http://libc-2.17.so>
> 7efd0e96c000-7efd0eb6c000 ---p 001c3000 00:24 62154             
>  /usr/lib64/libc-2.17.so <http://libc-2.17.so>
> 7efd0eb6c000-7efd0eb70000 r--p 001c3000 00:24 62154             
>  /usr/lib64/libc-2.17.so <http://libc-2.17.so>
> 7efd0eb70000-7efd0eb72000 rw-p 001c7000 00:24 62154             
>  /usr/lib64/libc-2.17.so <http://libc-2.17.so>
> 7efd0eb72000-7efd0eb77000 rw-p 00000000 00:00 0
> 7efd0eb77000-7efd0eb8e000 r-xp 00000000 00:24 62349             
>  /usr/lib64/libpthread-2.17.so <http://libpthread-2.17.so>
> 7efd0eb8e000-7efd0ed8d000 ---p 00017000 00:24 62349             
>  /usr/lib64/libpthread-2.17.so <http://libpthread-2.17.so>
> 7efd0ed8d000-7efd0ed8e000 r--p 00016000 00:24 62349             
>  /usr/lib64/libpthread-2.17.so <http://libpthread-2.17.so>
> 7efd0ed8e000-7efd0ed8f000 rw-p 00017000 00:24 62349             
>  /usr/lib64/libpthread-2.17.so <http://libpthread-2.17.so>
> 7efd0ed8f000-7efd0ed93000 rw-p 00000000 00:00 0
> 7efd0ed93000-7efd0edb8000 r-xp 00000000 00:24 62205             
>  /usr/lib64/libtinfo.so.5.9
> 7efd0edb8000-7efd0efb8000 ---p 00025000 00:24 62205             
>  /usr/lib64/libtinfo.so.5.9
> 7efd0efb8000-7efd0efbc000 r--p 00025000 00:24 62205             
>  /usr/lib64/libtinfo.so.5.9
> 7efd0efbc000-7efd0efbd000 rw-p 00029000 00:24 62205             
>  /usr/lib64/libtinfo.so.5.9
> 7efd0efbd000-7efd0efe3000 r-xp 00000000 00:24 62147             
>  /usr/lib64/libncurses.so.5.9
> 7efd0efe3000-7efd0f1e2000 ---p 00026000 00:24 62147             
>  /usr/lib64/libncurses.so.5.9
> 7efd0f1e2000-7efd0f1e3000 r--p 00025000 00:24 62147             
>  /usr/lib64/libncurses.so.5.9
> 7efd0f1e3000-7efd0f1e4000 rw-p 00026000 00:24 62147             
>  /usr/lib64/libncurses.so.5.9
> 7efd0f1e4000-7efd0f1ec000 r-xp 00000000 00:24 62410             
>  /usr/lib64/libhistory.so.6.2
> 7efd0f1ec000-7efd0f3eb000 ---p 00008000 00:24 62410             
>  /usr/lib64/libhistory.so.6.2
> 7efd0f3eb000-7efd0f3ec000 r--p 00007000 00:24 62410             
>  /usr/lib64/libhistory.so.6.2
> 7efd0f3ec000-7efd0f3ed000 rw-p 00008000 00:24 62410             
>  /usr/lib64/libhistory.so.6.2
> 7efd0f3ed000-7efd0f429000 r-xp 00000000 00:24 62408             
>  /usr/lib64/libreadline.so.6.2
> 7efd0f429000-7efd0f629000 ---p 0003c000 00:24 62408             
>  /usr/lib64/libreadline.so.6.2
> 7efd0f629000-7efd0f62b000 r--p 0003c000 00:24 62408             
>  /usr/lib64/libreadline.so.6.2
> 7efd0f62b000-7efd0f631000 rw-p 0003e000 00:24 62408             
>  /usr/lib64/libreadline.so.6.2
> 7efd0f631000-7efd0f633000 rw-p 00000000 00:00 0
> 7efd0f633000-7efd0f734000 r-xp 00000000 00:24 62170             
>  /usr/lib64/libm-2.17.so <http://libm-2.17.so>
> 7efd0f734000-7efd0f933000 ---p 00101000 00:24 62170             
>  /usr/lib64/libm-2.17.so <http://libm-2.17.so>
> 7efd0f933000-7efd0f934000 r--p 00100000 00:24 62170             
>  /usr/lib64/libm-2.17.so <http://libm-2.17.so>
> 7efd0f934000-7efd0f935000 rw-p 00101000 00:24 62170             
>  /usr/lib64/libm-2.17.so <http://libm-2.17.so>
> 7efd0f935000-7efd0f937000 r-xp 00000000 00:24 62166             
>  /usr/lib64/libdl-2.17.sogroup001_cpu        650216    0.003712       
>     0    0.000000  0.000000            4.8104e+16 Aborted

-- 
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf at rz.uni-kiel.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200821/b944f41d/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5404 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200821/b944f41d/attachment-0001.bin>


More information about the slurm-users mailing list