[slurm-users] Slurm Scheduler Help

Fri Jun 11 18:13:49 UTC 2021

Not sure it would work out to 60k queued jobs, but we're using:

SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

in our setup. bf_window is driven by our 30-day max job time, bf_resolution is at 5% of that time, and the other values are just what we landed on. This did manage to address some backfill issues we had in previous years.

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Dana, Jason T. <Jason.Dana at jhuapl.edu>
Date: Friday, June 11, 2021 at 12:27 PM
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [slurm-users] Slurm Scheduler Help

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________
Hello,

I currently manage a small cluster separated into 4 partitions. I am experiencing unexpected behavior with the scheduler when the queue has been flooded with a large number of jobs by a single user (around 60000) to a single partition. We have each user bound to a global grptres CPU limit. Once this user reaches their CPU limit the jobs are queued with reason “AssocGroupCpuLimit” but after a few hundred or so of the jobs it seems to switch to “Priority”. The issue is that once this switch occurs it appears to also impact all other partitions. Currently if any job is submitted to any of the partitions, regardless of resources available, they are all queued by the scheduler with the reason of “Priority”. We had the scheduler initially configured for backfill but have also tried switching to builtin and it did not seem to make a difference. I tried increasing the default_queue_depth to 100000 and it didn’t seem to help. The scheduler log is also unhelpful as it simply lists the accounting-limited jobs and never mentions the “Priority” queued jobs:

sched: [2021-06-11T13:21:53.993] JobId=495780 delayed for accounting policy
sched: [2021-06-11T13:21:53.997] JobId=495781 delayed for accounting policy
sched: [2021-06-11T13:21:54.001] JobId=495782 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] JobId=495783 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] loop taking too long, breaking out

I’ve gone through all the documentation I’ve found on the scheduler and cannot seem to resolve this. I’m hoping I’m simply missing something.

Any help would be great. Thank you!

Jason

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210611/67419886/attachment-0002.htm>