[slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?
Paul Edmon
pedmon at cfa.harvard.edu
Mon May 8 13:54:00 UTC 2023
I would recommend standing up an instance of XDMod as it handles most of
this for you in its summary reports.
https://open.xdmod.org/10.0/index.html
-Paul Edmon-
On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote:
> Good morning,
>
> We have at least one billed account right now, where the associated
> researchers are able to submit jobs that run against our normal queue
> with fairshare, but not for an academic research purpose. So we'd like
> to accurately calculate their CPU hours. We are currently using a
> script to query the db with sacct and sum up the value of ElapsedRaw *
> AllocCPUS for all jobs. But this seems limited, because requeueing
> will create what the sacct man page calls duplicates. By default jobs
> normally get requeued only if there's something outside of the user's
> control like a NODE_FAIL or an scontrol command to requeue it
> manually, though I think users can requeue things themselves, it's not
> a feature we've seen our researchers use.
>
> However with the new scrontab feature, whenever the cron is executed
> more than once, sacct reports that the previous jobs are "requeued"
> and are only visible by looking up duplicates. I haven't seen any
> billed account use requeueing or scrontab yet, but it's clear to me
> that it could be significant once researchers start using scrontab
> more. Scrontab has existed since one of the releases from 2020 I
> believe, but we enabled it this year and see it as much more powerful
> than the traditional linux crontab.
>
> What would be the best way to more thoroughly calculate ElapsedRaw *
> AllocCPUS, to account for duplicates, but optionally ignore
> unintentional requeueing like from a NODE_FAIL?
>
> Here's the main loop of the simple bash script I have now:
>
> while IFS='|' read -r end elapsed cpus; do
> # if a job crosses the month barrier
> # the entire bill will be put under the 2nd month
> year_month="${end:0:7}"
> if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
> continue
> fi
> core_seconds["$year_month"]=$(( core_seconds["$year_month"] +
> (elapsed * cpus) ))
> done < <(sacct -a -A "$SLURM_ACCOUNT" \
> -S "$START_DATE" \
> -E "$END_DATE" \
> -o End,ElapsedRaw,AllocCPUS -X -P --noheader)
>
> Our slurmdbd is configured to keep 6 months of data.
>
> It make senses to loop through the jobids instead, using sacct's
> -D/--duplicates option each time to reveal the hidden duplicates in
> the REQUEUED state, but I'm interested if there are alternatives or if
> I'm missing anything here.
>
> Thanks,
>
> Joseph
>
> --------------------------------------------------------------
> Joseph F. Guzman - ITS (Advanced Research Computing)
>
> Northern Arizona University
>
> Joseph.F.Guzman at nau.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230508/fa57f9fc/attachment.htm>
More information about the slurm-users
mailing list