[slurm-users] Slurm accounting problem - NCPUs=0
Coulter, John Eric
jecoulte at iu.edu
Tue Jan 30 13:13:00 MST 2018
Hi All,
I've run into a strange problem with my slurm configuration. Trying to set up AccountingStorage properly so that I can use OpenXDMoD for producing usage reports, but the output I'm getting from sacct only has 0's for a huge number of fields like NCPUs and CPUTimeRaw (which are rather important for useage reports).
Has anyone here run into something similar before? It would be great if someone could point out what I've mis-configured. I've pasted the relevant bits of my slurm config and sacct output after my sig.
Thanks!
------------------------------------
Eric Coulter jecoulte at iu.edu
XSEDE Capabilities and Resource Integration Engineer
IU Campus Bridging & Research Infrastructure
RT/PTI/UITS
812-856-3250
jecoulte at headnode ~]$ scontrol show config | grep Acc
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = headnode
AccountingStorageLoc = /var/log/slurmacct.log
AccountingStoragePort = 0
AccountingStorageTRES = cpu,mem,energy,node #Added these in case the default wasn't being respected for some reason...
AccountingStorageType = accounting_storage/filetxt
AccountingStorageUser = root
AccountingStoreJobComment = Yes
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)?
For a job running on 2 nodes, 1 cpu per node, sacct shows:
[jecoulte at headnode ~]$ sudo sacct -j 386 --format JobID,JobName,AllocNodes,TotalCPU,CPUTime,NCPUS,CPUTimeRaw,AllocCPUs
JobID JobName AllocNodes TotalCPU CPUTime NCPUS CPUTimeRAW AllocCPUS
------------ ---------- ---------- ---------- ---------- ---------- ---------- ----------
386 fact_job.+ 2 00:49.345 00:00:00 0 0 0
386.0 hostname 2 00:00.006 00:00:00 0 0 0
386.1 fact-sum.g 2 00:49.338 00:00:00 0 0 0
For the same job, the record in AccountingStorageLoc is:
[jecoulte at headnode ~]$ grep ^386 /var/log/slurmacct.log
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006538 1000 1000 - - 1 0 1 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0 0.00 hostname compute-[0-1] 0 0 0 0 (null) 4294967295
386 low 1517006536 1517006538 1000 1000 - - 1 0 3 0 2 2 0 0 6466 0 5388 0 1078 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269148 1 236380.00 620 1 618.00 0 1 0.00 0 1 0.00 hostname compute-[0-1] 1 1 1 1 (null) 4294967295
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006538 1000 1000 - - 1 1 1 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0 0.00 fact-sum.g compute-[0-1] 0 0 0 0 (null) 4294967295
386 low 1517006536 1517006565 1000 1000 - - 1 1 3 0 2 2 27 49 338902 48 94477 1 244425 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269148 1 236380.00 620 1 618.00 0 1 0.00 0 1 0.00 fact-sum.g compute-[0-1] 1 1 1 1 (null) 4294967295
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006565 1000 1000 - - 3 28 3 4294967295 0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180130/83165e77/attachment.html>
More information about the slurm-users
mailing list