[slurm-users] Areas for improvement on our site's cluster scheduling

Tue May 8 01:49:01 MDT 2018

 > Eventually the job aging makes the jobs so high-priority,

Guess I should look in the manual, but could you increase the job ageing
time parameters?
I guess it is also worth saying that this is the scheduler doing its job -
it is supposed to keep jobs ready and waiting to go, to keep the cluster
busy!

I was going to suggest that you could have a cron job, which then looks at
the jobs the 'queue stuffer' has and moves some of them down in priority.
This is a bad suggestion - in general writing a 'scheduler within a
scheduler' is not a good thing and you only end up fighting the real
scheduler.

I did have a similar situation on my last job - a user needed to get some
work done, and submitted a huge amount of jobs.
It happened to be that there was a low load on the cluster at the time, so
this user got a lot of job started. We finalyl had to temporarily limit the
maximum amount of jobs he could submit. Again if you think about it this is
a good thing - we are operating batch queuing systems and this user
was putting it to good use.

The 'problem' is more related to the length of the job. If the 'queue
stuffer' is submitting jobs with a long wallclock time then yes you will
get complaints
from the other users. With shorter jobs there is more opportunity for other
users to 'get a look in' as we say in Glasgow.

Actually what IS bad is users not putting cluster resources to good use.
You can often see jobs which are 'stalled'  - ie the nodes are reserved for
the job,
but the internal logic of the job has failed and the executables have not
launched. Or maybe some user is running an interactive job and has wandered
off for coffee/beer/an extended holiday.  It is well worth scanning for
stalled jobs and terminating them.

On 8 May 2018 at 09:25, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:

> On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote:
>
>> Jonathon A Anderson <jonathon.anderson at colorado.edu> writes:
>>
>> ## Queue stuffing
>>>
>>
>> There is the bf_max_job_user SchedulerParameter, which is sort of the
>> "poor man's MAXIJOB"; it limits the number of jobs from each user the
>> backfiller will try to start on each run.  It doesn't do exactly what
>> you want, but at least the backfiller will not create reservations for
>> _all_ the queue stuffer's jobs.
>>
>
> Adding to this I discuss backfilling configuration in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#scheduler-configuration
>
> The MaxJobCount limit etc. is described in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#maxjobcount-limit
>
> /Ole
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180508/90372f47/attachment.html>