[slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

Wed May 3 18:05:28 UTC 2023

Good morning,

We have at least one billed account right now, where the associated researchers are able to submit jobs that run against our normal queue with fairshare, but not for an academic research purpose. So we'd like to accurately calculate their CPU hours. We are currently using a script to query the db with sacct and sum up the value of ElapsedRaw * AllocCPUS for all jobs. But this seems limited, because requeueing will create what the sacct man page calls duplicates. By default jobs normally get requeued only if there's something outside of the user's control like a NODE_FAIL or an scontrol command to requeue it manually, though I think users can requeue things themselves, it's not a feature we've seen our researchers use.

However with the new scrontab feature, whenever the cron is executed more than once, sacct reports that the previous jobs are "requeued" and are only visible by looking up duplicates. I haven't seen any billed account use requeueing or scrontab yet, but it's clear to me that it could be significant once researchers start using scrontab more. Scrontab has existed since one of the releases from 2020 I believe, but we enabled it this year and see it as much more powerful than the traditional linux crontab.

What would be the best way to more thoroughly calculate ElapsedRaw * AllocCPUS, to account for duplicates, but optionally ignore unintentional requeueing like from a NODE_FAIL?

Here's the main loop of the simple bash script I have now:

while IFS='|' read -r end elapsed cpus; do
    # if a job crosses the month barrier
    # the entire bill will be put under the 2nd month
    year_month="${end:0:7}"
    if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
        continue
    fi
    core_seconds["$year_month"]=$(( core_seconds["$year_month"] + (elapsed * cpus) ))
done < <(sacct -a -A "$SLURM_ACCOUNT" \
               -S "$START_DATE" \
               -E "$END_DATE" \
               -o End,ElapsedRaw,AllocCPUS -X -P --noheader)

Our slurmdbd is configured to keep 6 months of data.

It make senses to loop through the jobids instead, using sacct's -D/--duplicates option each time to reveal the hidden duplicates in the REQUEUED state, but I'm interested if there are alternatives or if I'm missing anything here.

Thanks,

Joseph

--------------------------------------------------------------
Joseph F. Guzman - ITS (Advanced Research Computing)

Northern Arizona University

Joseph.F.Guzman at nau.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230503/cac29c73/attachment.htm>