[slurm-users] Slurm accounting problem - NCPUs=0

Coulter, John Eric jecoulte at iu.edu
Mon Feb 19 14:11:09 MST 2018


​Appears that the filetxt accounting just doesn't record those fields.
(The man page has a rather cryptic "Note: The filetxt plugin records only a limited subset of accounting information and  will  prevent  some  sacct  options  from proper operation.​"
I did not expect something like NCPUs to be outside of a useful but limited subset of information...)

Assume this is similar to the situation with jobcomp/filetxt as here - https://bugs.schedmd.com/show_bug.cgi?id=3229

If anyone else runs into this issue, the fix appears to be that one should just use slurmdbd - thought I'd send this out; all the previous posts I came across with the same issue never had a resolution.

Cheers,


------------------------------------
Eric Coulter         jecoulte at iu.edu
XSEDE Capabilities and Resource Integration Engineer
IU Campus Bridging & Research Infrastructure
RT/PTI/UITS
https://www.xsede.org/ecosystem/xcri-mission
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Coulter, John Eric <jecoulte at iu.edu>
Sent: Tuesday, January 30, 2018 3:13 PM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] Slurm accounting problem - NCPUs=0


Hi All,

I've run into a strange problem with my slurm configuration. Trying to set up AccountingStorage properly so that I can use OpenXDMoD for producing usage reports, but the output I'm getting from sacct only has 0's for a huge number of fields like NCPUs and CPUTimeRaw (which are rather important for useage reports).

Has anyone here run into something similar before? It would be great if someone could point out what I've mis-configured. I've pasted the relevant bits of my slurm config and sacct output after my sig.

Thanks!


------------------------------------
Eric Coulter         jecoulte at iu.edu
XSEDE Capabilities and Resource Integration Engineer
IU Campus Bridging & Research Infrastructure
RT/PTI/UITS
812-856-3250

jecoulte at headnode ~]$ scontrol show config | grep Acc
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = headnode
AccountingStorageLoc    = /var/log/slurmacct.log
AccountingStoragePort   = 0
AccountingStorageTRES   = cpu,mem,energy,node      #Added these in case the default wasn't being respected for some reason...
AccountingStorageType   = accounting_storage/filetxt
AccountingStorageUser   = root
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)​

For a job running on 2 nodes, 1 cpu per node, sacct shows:
[jecoulte at headnode ~]$ sudo sacct -j 386 --format JobID,JobName,AllocNodes,TotalCPU,CPUTime,NCPUS,CPUTimeRaw,AllocCPUs
       JobID    JobName AllocNodes   TotalCPU    CPUTime      NCPUS CPUTimeRAW  AllocCPUS
------------ ---------- ---------- ---------- ---------- ---------- ---------- ----------
386          fact_job.+          2  00:49.345   00:00:00          0          0          0
386.0          hostname          2  00:00.006   00:00:00          0          0          0
386.1        fact-sum.g          2  00:49.338   00:00:00          0          0          0

For the same job, the record in AccountingStorageLoc is:
[jecoulte at headnode ~]$ grep ^386 /var/log/slurmacct.log
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006538 1000 1000 - - 1 0 1 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0 0.00 hostname compute-[0-1] 0 0 0 0 (null) 4294967295
386 low 1517006536 1517006538 1000 1000 - - 1 0 3 0 2 2 0 0 6466 0 5388 0 1078 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269148 1 236380.00 620 1 618.00 0 1 0.00 0 1 0.00 hostname compute-[0-1] 1 1 1 1 (null) 4294967295
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006538 1000 1000 - - 1 1 1 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0 0.00 fact-sum.g compute-[0-1] 0 0 0 0 (null) 4294967295
386 low 1517006536 1517006565 1000 1000 - - 1 1 3 0 2 2 27 49 338902 48 94477 1 244425 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269148 1 236380.00 620 1 618.00 0 1 0.00 0 1 0.00 fact-sum.g compute-[0-1] 1 1 1 1 (null) 4294967295
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006537 1000 1000 - - 0 fact_job.job 1 4294901759 2 compute-[0-1] (null)
386 low 1517006536 1517006565 1000 1000 - - 3 28 3 4294967295 0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180219/377ff720/attachment-0001.html>


More information about the slurm-users mailing list