<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">
<div>What happens if you change <br>
</div>
<div><br>
</div>
<div>AccountingStorageHost=localhost</div>
<div><br>
</div>
<div>to</div>
<div><br>
</div>
<div>AccountingStorageHost=192.168.1.1</div>
<div></div>
<div>
<div>
<div class="gmail_signature" data-smartmail="gmail_signature">i.e. same IP address as your ctl, and restart the ctld<br>
</div>
<div class="gmail_signature" data-smartmail="gmail_signature"><br>
</div>
<div class="gmail_signature" data-smartmail="gmail_signature">Sean</div>
<div class="gmail_signature" data-smartmail="gmail_signature"><br>
</div>
<div class="gmail_signature" data-smartmail="gmail_signature">--<br>
</div>
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>
Research Computing Services | Business Services<br>
The University of Melbourne, Victoria 3010 Australia<br>
<br>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 19 Mar 2020 at 22:05, Pascal Klink <<a href="mailto:pascal.klink@googlemail.com">pascal.klink@googlemail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Hi everyone,<br>
<br>
we currently have a problem with our SLURM setup for a small cluster of 7 machines. The problem is that the accounted core usage is not correctly used for the share computation. I have set up a minimal (not) working example.<br>
<br>
In this example, we have one cluster to which we have added an account 'iasteam‘ as well as some users with the sacctmgr tool. Right after executing the corresponding commands and running 'sshare -al' we get the following output:<br>
<br>
<br>
Account         User            RawShares       NormShares    RawUsage          NormUsage       EffectvUsage    FairShare       LevelFS                   
<br>
------------            ----------      ----------              -----------             -----------             -----------             -------------   ----------              ----------
<br>
root                                                            0.000000        0                                               1.000000                                                     
<br>
 root                   root            1                       0.500000        0                       0.000000        0.000000        1.000000                inf                               
<br>
 iasteam                                1                       0.500000        0                       0.000000        0.000000                                inf                               
<br>
  iasteam               carvalho        1                       0.250000        0                                               0.000000        0.000000        0.000000                               
<br>
  iasteam               hany            1                       0.250000        0                                               0.000000        0.000000        0.000000                               
<br>
  iasteam               pascal          1                       0.250000                0                                               0.000000        0.000000        0.000000                               
<br>
  iasteam               stark           1                       0.250000                0                                               0.000000        0.000000        0.000000                               <br>
<br>
One thing that I think is already strange here is that the ‚FairShare' value is set to zero and no ‚NormUsage' appears. But anyways, after executing the following commands:<br>
<br>
        sudo systemctl stop slurmctld<br>
        sudo systemctl restart slurmdbd<br>
        sudo systemctl start slurmctld<br>
<br>
I get an output that looks better to me<br>
<br>
Account         User            RawShares       NormShares    RawUsage          NormUsage       EffectvUsage    FairShare       LevelFS                   
<br>
------------            ----------      ----------              -----------             -----------             -----------             -------------   ----------              ----------
<br>
root                                                            0.000000        0                                               1.000000                                                     
<br>
 root                   root            1                       0.500000        0                       0.000000        0.000000        1.000000                inf                               
<br>
 iasteam                                1                       0.500000        0                       0.000000        0.000000                                inf                               
<br>
  iasteam               carvalho        1                       0.250000        0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               hany            1                       0.250000        0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               pascal          1                       0.250000                0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               stark           1                       0.250000                0                       0.000000                0.000000        1.000000        inf                             
<br>
<br>
<br>
The next thing I did was to run a job with the user pascal, cancelling after ~3:33 minutes on a node with 32 cores. When I then execute
<br>
<br>
        sudo sacct -u pascal -o User,UserCPU,CPUTimeRAW,JobID<br>
<br>
I get the following output:<br>
<br>
User    UserCPU         CPUTimeRAW      JobID <br>
---------       ----------              ----------                      ------------
<br>
pascal          02:53.154               6816                    776_2        <br>
                02:53.154               6848                    776_2.batch<br>
<br>
Dividing 6848 by 32 yields 214 seconds, which is 3:34 minutes. So this calculation checks out. The problem now is, that this data is not reflected in the call to 'sshare -al', which still yields<br>
<br>
Account         User            RawShares       NormShares    RawUsage          NormUsage       EffectvUsage    FairShare       LevelFS                   
<br>
------------            ----------      ----------              -----------             -----------             -----------             -------------   ----------              ----------
<br>
root                                                            0.000000        0                                               1.000000                                                     
<br>
 root                   root            1                       0.500000        0                       0.000000        0.000000        1.000000                inf                               
<br>
 iasteam                                1                       0.500000        0                       0.000000        0.000000                                inf                               
<br>
  iasteam               carvalho        1                       0.250000        0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               hany            1                       0.250000        0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               pascal          1                       0.250000                0                       0.000000                0.000000        1.000000        inf<br>
  iasteam               stark           1                       0.250000                0                       0.000000                0.000000        1.000000        inf     <br>
<br>
<br>
Even after waiting a night (assuming that the update of the data for sshare may be asynchronous), 'sshare -al' still shows the incorrect usage. I think this is because of some communication failure between slurmdbd and slurmctld, as sacct uses the data from
 slurmdbd while sshare seems to use data from slurmctld (at least it is not possible to run sshare if slurmctld is not running).<br>
<br>
Is this some common misconfiguration of our SLURM setup or is there some other strange thing going on? We already realized that there was a similar question asked in the developer mailing list 6 years ago:<br>
<br>
<a href="https://slurm-dev.schedmd.narkive.com/nvLr2Rzl/sshare-and-sacct" rel="noreferrer" target="_blank">https://slurm-dev.schedmd.narkive.com/nvLr2Rzl/sshare-and-sacct</a><br>
<br>
However, there was not real answer given why this happened. So we thought that maybe this time someone may have an idea.<br>
<br>
Best<br>
Pascal<br>
<br>
<br>
P.S.: Here is the slurm config that we are using, as well as the slurmdbd config:<br>
<br>
slurm.conf:<br>
ControlMachine=mn01<br>
ControlAddr=192.168.1.1<br>
<br>
MpiDefault=none<br>
ProctrackType=proctrack/cgroup<br>
ReturnToService=1<br>
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>
SlurmdSpoolDir=/var/spool/slurmd<br>
SlurmUser=slurm<br>
StateSaveLocation=/var/spool/slurmctld<br>
SwitchType=switch/none<br>
TaskPlugin=task/none<br>
<br>
# SCHEDULING<br>
FastSchedule=1<br>
SchedulerType=sched/backfill<br>
SelectType=select/linear<br>
<br>
# ACCOUNTING<br>
AccountingStorageType=accounting_storage/slurmdbd<br>
AccountingStorageHost=localhost<br>
AccountingStoragePort=6819<br>
JobAcctGatherType=jobacct_gather/linux<br>
JobAcctGatherFrequency=10<br>
AccountingStorageEnforce=associations<br>
AccountingStorageUser=slurm<br>
ClusterName=iascluster<br>
<br>
# PRIORITY<br>
PriorityType=priority/multifactor<br>
PriorityDecayHalfLife=0<br>
PriorityUsageResetPeriod=MONTHLY<br>
PriorityFavorSmall=NO<br>
PriorityMaxAge=1-0<br>
<br>
PriorityWeightAge=500000<br>
PriorityWeightFairshare=1000000<br>
PriorityWeightJobSize=0<br>
PriorityWeightPartition=0<br>
PriorityWeightQOS=0<br>
<br>
# LOGGING<br>
SlurmctldDebug=debug<br>
SlurmctldLogFile=var/log/slurm/slurmctld.log<br>
<br>
SlurmdDebug=debug<br>
SlurmdLogFile=var/log/slurm/slurmd.log<br>
<br>
# COMPUTE NODES<br>
NodeName=cn0[1-7] NodeAddr=192.168.1.1[1-7] RealMemory=64397 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:rtx2080:1<br>
PartitionName=amd Nodes=cn0[1-7] Default=YES MaxTime=INFINITE State=UP<br>
<br>
<br>
slurmdbd.conf:<br>
AuthType=auth/munge<br>
AuthInfo=/var/run/munge/munge.socket.2<br>
DbdHost=localhost<br>
DbdPort=6819<br>
StorageHost=localhost<br>
StorageLoc=slurm_acct_db<br>
StoragePass=[OURPASSWORD]<br>
StorageType=accounting_storage/mysql<br>
StorageUser=slurm<br>
DebugLevel=debug<br>
LogFile=/var/log/slurm/slurmdbd.log<br>
PidFile=/var/run/slurm-llnl/slurmdbd.pid<br>
SlurmUser=slurm<br>
<br>
<br>
</blockquote>
</div>
</body>
</html>