[slurm-users] Account name length issue with Slurm 17.02.9
Simon Flood
S.M.Flood at uis.cam.ac.uk
Wed Dec 20 08:52:41 MST 2017
In the past couple of days we've noticed an odd issue when creating new
accounts with which we think is related to the length of the account name.
Having recently launched a new cluster we've switched to using account
names with the format <PIsurname>-<servicelevel>-<type> where
servicelevel is SL[1-4] and type is CPU, GPU, or KNL. At a minimum we'd
expect to have <PIsurname>-SL3-CPU and <PIsurname>-SL4-CPU then
optionally <PIsurname>-SL[3-4]-GPU and/or <PIsurname>-SL[34]-KNL and
possibly paying SL1 or SL2 accounts for any, or all of, CPU, GPU, and KNL.
What we've found is that if we create a PI_TESTABCDE-FGHIJK (I've
replaced actual PI's surname with TESTABCDE-FGHIJK but it was that long
- a double-barrelled surname) account then a TESTABCDE-FGHIJK-SL3-CPU
and TESTABCDE-FGHIJK-SL4-CPU account, each with PI_TESTABCDE-FGHIJK as
their parent, sacctmgr then complains when we try and create a
TESTABCDE-FGHIJK-SL3-GPU account. See below for various commands and output:
[root at slurm-master ~]# sacctmgr -vi add account Name=pi_testabcde-fghijk
Description="Simon Flood" Cluster=csd3 parent=uis fairshare=parent
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
Adding Account(s)
pi_testabcde-fghijk
Settings
Description = simon flood
Organization = Parent/Account Name
Associations
A = pi_testabc C = csd3
Settings
Fairshare = parent
Parent = uis
[root at slurm-master ~]# sacctmgr -vi add account
Name=TESTABCDE-FGHIJK-SL3-CPU GrpTRESMins=cpu=12000000 DefaultQOS=cpu2
QOS=cpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
Adding Account(s)
testabcde-fghijk-sl3-cpu
Settings
Description = Account Name
Organization = Parent/Account Name
Associations
A = testabcde- C = csd3
Settings
Fairshare = 0
GrpTRESMins = cpu=12000000
Parent = pi_testabcde-fghijk
QOS = cpu2,intr
DefQOS = cpu2
[root at slurm-master ~]# sacctmgr -vi add account
Name=TESTABCDE-FGHIJK-SL4-CPU QOS=cpu3 Cluster=csd3
parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
Adding Account(s)
testabcde-fghijk-sl4-cpu
Settings
Description = Account Name
Organization = Parent/Account Name
Associations
A = testabcde- C = csd3
Settings
Fairshare = 0
Parent = pi_testabcde-fghijk
QOS = cpu3
[root at slurm-master ~]# sacctmgr -vi add account
Name=TESTABCDE-FGHIJK-SL3-GPU GrpTRESMins=gres/gpu=480000
DefaultQOS=gpu2 QOS=gpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk
fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
Adding Account(s)
testabcde-fghijk-sl3-gpu
Settings
Description = Account Name
Organization = Parent/Account Name
Associations
A = testabcde- C = csd3
Settings
Fairshare = 0
GrpTRESMins = gres/gpu=480000
Parent = pi_testabcde-fghijk
QOS = gpu2,intr
DefQOS = gpu2
Problem adding accounts: Unspecified error
[root at slurm-master ~]# sacctmgr -n show account
format=Account'%-25',Description'%-30',Organization'%-20' | grep -i
testabcde-fghijk
pi_testabcde-fghijk simon flood uis
testabcde-fghijk-sl3-cpu testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu testabcde-fghijk-sl4-cpu pi_testabcde-fghijk
When we originally saw this on Monday trying to create the
TESTABCDE-FGHIJK-SL3-GPU account gave an output suggesting it was trying
to create an association rather than account but that didn't happen when
repeating with fake "PI surname" for this message.
The other odd thing which we suspect is related is that when trying to
undo these account additions (as we created them with shorter names) is
that the delete deletes the association but not the actual accounts:
[root at slurm-master ~]# sacctmgr delete account
name=testabcde-fghijk-sl3-cpu cluster=csd3
Deleting account associations...
C = csd3 A = testabcde-fghijk-sl3-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root at slurm-master ~]# sacctmgr delete account
name=testabcde-fghijk-sl4-cpu cluster=csd3
Deleting account associations...
C = csd3 A = testabcde-fghijk-sl4-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root at slurm-master ~]# sacctmgr -n show account
format=Account'%-25',Description'%-30',Organization'%-20' | grep -i
testabcde-fghijk
pi_testabcde-fghijk simon flood uis
testabcde-fghijk-sl3-cpu testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu testabcde-fghijk-sl4-cpu pi_testabcde-fghijk
If we then check the MySQL table it shows the accounts still exist but
not associations. We're then tidying up by deleting the accounts
manually in MySQL.
Our guess is that when creating the account sacctmgr is checking and
comparing partial existing account names hence thinking there's a clash.
I've had a quick look at the various bits of source code for sacctmgr
but with my limited C knowledge haven't spotted anything obvious.
Previously we were using a mix of <PIsurname>-<servicelevel> for CPU and
<PIsurname>-<servicelevel>-GPU for GPU (we didn't have KNL) so it's
possible this issue existed in an earlier version of Slurm (we are using
Slurm 14.11.8 on our old cluster) but we weren't hitting it.
Our new Slurm master is running Slurm 17.02.9 on Red Hat Enterprise
Linux 7.3.
If anyone wants further information please ask though obviously we're
coming up to the Christmas holidays so responses might be delayed.
Regards,
Simon
--
Simon Flood
HPC System Administrator
University of Cambridge Information Services
United Kingdom
More information about the slurm-users
mailing list