[slurm-users] Account name length issue with Slurm 17.02.9

Simon Flood S.M.Flood at uis.cam.ac.uk
Wed Dec 20 08:52:41 MST 2017


In the past couple of days we've noticed an odd issue when creating new 
accounts with which we think is related to the length of the account name.

Having recently launched a new cluster we've switched to using account 
names with the format <PIsurname>-<servicelevel>-<type> where 
servicelevel is SL[1-4] and type is CPU, GPU, or KNL. At a minimum we'd 
expect to have <PIsurname>-SL3-CPU and <PIsurname>-SL4-CPU then 
optionally <PIsurname>-SL[3-4]-GPU and/or <PIsurname>-SL[34]-KNL and 
possibly paying SL1 or SL2 accounts for any, or all of, CPU, GPU, and KNL.

What we've found is that if we create a PI_TESTABCDE-FGHIJK (I've 
replaced actual PI's surname with TESTABCDE-FGHIJK but it was that long 
- a double-barrelled surname) account then a TESTABCDE-FGHIJK-SL3-CPU 
and TESTABCDE-FGHIJK-SL4-CPU account, each with PI_TESTABCDE-FGHIJK as 
their parent, sacctmgr then complains when we try and create a 
TESTABCDE-FGHIJK-SL3-GPU account. See below for various commands and output:

[root at slurm-master ~]# sacctmgr -vi add account Name=pi_testabcde-fghijk 
Description="Simon Flood" Cluster=csd3 parent=uis fairshare=parent
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
  Adding Account(s)
   pi_testabcde-fghijk
  Settings
   Description     = simon flood
   Organization    = Parent/Account Name
  Associations
   A = pi_testabc C = csd3
  Settings
   Fairshare     = parent
   Parent        = uis
[root at slurm-master ~]# sacctmgr -vi add account 
Name=TESTABCDE-FGHIJK-SL3-CPU GrpTRESMins=cpu=12000000 DefaultQOS=cpu2 
QOS=cpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
  Adding Account(s)
   testabcde-fghijk-sl3-cpu
  Settings
   Description     = Account Name
   Organization    = Parent/Account Name
  Associations
   A = testabcde- C = csd3
  Settings
   Fairshare     = 0
   GrpTRESMins   = cpu=12000000
   Parent        = pi_testabcde-fghijk
   QOS           = cpu2,intr
   DefQOS        = cpu2
[root at slurm-master ~]# sacctmgr -vi add account 
Name=TESTABCDE-FGHIJK-SL4-CPU QOS=cpu3 Cluster=csd3 
parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
  Adding Account(s)
   testabcde-fghijk-sl4-cpu
  Settings
   Description     = Account Name
   Organization    = Parent/Account Name
  Associations
   A = testabcde- C = csd3
  Settings
   Fairshare     = 0
   Parent        = pi_testabcde-fghijk
   QOS           = cpu3
[root at slurm-master ~]# sacctmgr -vi add account 
Name=TESTABCDE-FGHIJK-SL3-GPU GrpTRESMins=gres/gpu=480000 
DefaultQOS=gpu2 QOS=gpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk 
fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
  Adding Account(s)
   testabcde-fghijk-sl3-gpu
  Settings
   Description     = Account Name
   Organization    = Parent/Account Name
  Associations
   A = testabcde- C = csd3
  Settings
   Fairshare     = 0
   GrpTRESMins   = gres/gpu=480000
   Parent        = pi_testabcde-fghijk
   QOS           = gpu2,intr
   DefQOS        = gpu2
  Problem adding accounts: Unspecified error
[root at slurm-master ~]# sacctmgr -n show account 
format=Account'%-25',Description'%-30',Organization'%-20' | grep -i 
testabcde-fghijk
pi_testabcde-fghijk       simon flood                    uis
testabcde-fghijk-sl3-cpu  testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu  testabcde-fghijk-sl4-cpu pi_testabcde-fghijk

When we originally saw this on Monday trying to create the 
TESTABCDE-FGHIJK-SL3-GPU account gave an output suggesting it was trying 
to create an association rather than account but that didn't happen when 
repeating with fake "PI surname" for this message.

The other odd thing which we suspect is related is that when trying to 
undo these account additions (as we created them with shorter names) is 
that the delete deletes the association but not the actual accounts:

[root at slurm-master ~]# sacctmgr delete account 
name=testabcde-fghijk-sl3-cpu cluster=csd3
  Deleting account associations...
   C = csd3       A = testabcde-fghijk-sl3-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root at slurm-master ~]# sacctmgr delete account 
name=testabcde-fghijk-sl4-cpu cluster=csd3
  Deleting account associations...
   C = csd3       A = testabcde-fghijk-sl4-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root at slurm-master ~]# sacctmgr -n show account 
format=Account'%-25',Description'%-30',Organization'%-20' | grep -i 
testabcde-fghijk
pi_testabcde-fghijk       simon flood                    uis
testabcde-fghijk-sl3-cpu  testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu  testabcde-fghijk-sl4-cpu pi_testabcde-fghijk

If we then check the MySQL table it shows the accounts still exist but 
not associations. We're then tidying up by deleting the accounts 
manually in MySQL.

Our guess is that when creating the account sacctmgr is checking and 
comparing partial existing account names hence thinking there's a clash. 
I've had a quick look at the various bits of source code for sacctmgr 
but with my limited C knowledge haven't spotted anything obvious.

Previously we were using a mix of <PIsurname>-<servicelevel> for CPU and 
<PIsurname>-<servicelevel>-GPU for GPU (we didn't have KNL) so it's 
possible this issue existed in an earlier version of Slurm (we are using 
Slurm 14.11.8 on our old cluster) but we weren't hitting it.

Our new Slurm master is running Slurm 17.02.9 on Red Hat Enterprise 
Linux 7.3.

If anyone wants further information please ask though obviously we're 
coming up to the Christmas holidays so responses might be delayed.

Regards,
Simon
-- 
Simon Flood
HPC System Administrator
University of Cambridge Information Services
United Kingdom



More information about the slurm-users mailing list