[slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

Mon May 8 13:54:00 UTC 2023

I would recommend standing up an instance of XDMod as it handles most of 
this for you in its summary reports.

https://open.xdmod.org/10.0/index.html

-Paul Edmon-

On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote:
> Good morning,
>
> We have at least one billed account right now, where the associated 
> researchers are able to submit jobs that run against our normal queue 
> with fairshare, but not for an academic research purpose. So we'd like 
> to accurately calculate their CPU hours. We are currently using a 
> script to query the db with sacct and sum up the value of ElapsedRaw * 
> AllocCPUS for all jobs. But this seems limited, because requeueing 
> will create what the sacct man page calls duplicates. By default jobs 
> normally get requeued only if there's something outside of the user's 
> control like a NODE_FAIL or an scontrol command to requeue it 
> manually, though I think users can requeue things themselves, it's not 
> a feature we've seen our researchers use.
>
> However with the new scrontab feature, whenever the cron is executed 
> more than once, sacct reports that the previous jobs are "requeued" 
> and are only visible by looking up duplicates. I haven't seen any 
> billed account use requeueing or scrontab yet, but it's clear to me 
> that it could be significant once researchers start using scrontab 
> more. Scrontab has existed since one of the releases from 2020 I 
> believe, but we enabled it this year and see it as much more powerful 
> than the traditional linux crontab.
>
> What would be the best way to more thoroughly calculate ElapsedRaw * 
> AllocCPUS, to account for duplicates, but optionally ignore 
> unintentional requeueing like from a NODE_FAIL?
>
> Here's the main loop of the simple bash script I have now:
>
> while IFS='|' read -r end elapsed cpus; do
>     # if a job crosses the month barrier
>     # the entire bill will be put under the 2nd month
>     year_month="${end:0:7}"
>     if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
>         continue
>     fi
>     core_seconds["$year_month"]=$(( core_seconds["$year_month"] + 
> (elapsed * cpus) ))
> done < <(sacct -a -A "$SLURM_ACCOUNT" \
>                -S "$START_DATE" \
>                -E "$END_DATE" \
>                -o End,ElapsedRaw,AllocCPUS -X -P --noheader)
>
> Our slurmdbd is configured to keep 6 months of data.
>
> It make senses to loop through the jobids instead, using sacct's 
> -D/--duplicates option each time to reveal the hidden duplicates in 
> the REQUEUED state, but I'm interested if there are alternatives or if 
> I'm missing anything here.
>
> Thanks,
>
> Joseph
>
> --------------------------------------------------------------
> Joseph F. Guzman - ITS (Advanced Research Computing)
>
> Northern Arizona University
>
> Joseph.F.Guzman at nau.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230508/fa57f9fc/attachment.htm>