[slurm-users] Larger jobs tend to get starved out on our cluster

Thu Jan 10 09:52:42 MST 2019

Hi D.J.,

I noticed you have:

PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE

I'm pretty sure it does not makes sense to have depth oblivious, and fair tree set at the same time. You'll want to choose one of them. That’s not going to be reason for the issue however, but you are likely not running the fairshare algorithm that was intended.

"My colleague is from a Moab background, and in that respect he was
    > surprised not to see nodes being reserved for jobs, but it could be
    > that Slurm works in a different way to try to make efficient use of
    > the cluster by backfilling more aggressively than Moab."

Slurm unfortunately does not indicate when nodes are being put aside for large jobs. I wish that it did. Nodes will instead be in "idle" state when prepping for a large job.

To increase the possibility of more whole nodes being available for large MPI jobs to get them started faster, you might consider the following parameters:

SelectTypeParameters=CR_Pack_Nodes

And 

SchedulerParameters=pack_serial_at_end, bf_busy_nodes

Also, as Loris pointed out, bf_window will need to be set to the max wall time in minutes.

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 1/9/19, 11:52 PM, "slurm-users on behalf of Loris Bennett" <slurm-users-bounces at lists.schedmd.com on behalf of loris.bennett at fu-berlin.de> wrote:

    Hi David,

    If your maximum run-time is more than the 2 1/2 days (3600 minutes) you
    have set for bf_window, you might need to increase bf_window
    accordingly.   See the description here:

    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html&data=02%7C01%7Cchris.coffey%40nau.edu%7Cc83886a4754c403440c408d676c828f2%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826999384921564&sdata=ONaDCNBDktzNLoJDxBtDkz9g9J4XQr1chN6ijaF0vQg%3D&reserved=0

    Cheers,

    Loris

    Baker D.J. <D.J.Baker at soton.ac.uk> writes:

    > Hello,
    >
    > A colleague intimated that he thought that larger jobs were tending to
    > get starved out on our slurm cluster. It's not a busy time at the
    > moment so it's difficult to test this properly. Back in November it
    > was not completely unusual for a larger job to have to wait up to a
    > week to start.
    >
    > I've extracted the key scheduling configuration out of the slurm.conf
    > and I would appreciate your comments, please. Even at the busiest of
    > times we notice many single compute jobs executing on the cluster --
    > starting either via the scheduler or by backfill.
    >
    > Looking at the scheduling configuration do you think that I'm
    > favouring small jobs too much? That is, for example, should I increase
    > the PriorityWeightJobSize to encourage larger jobs to run?
    >
    > I was very keen not to starve out small/medium jobs, however perhaps
    > there is too much emphasis on small/medium jobs in our setup.
    >
    > My colleague is from a Moab background, and in that respect he was
    > surprised not to see nodes being reserved for jobs, but it could be
    > that Slurm works in a different way to try to make efficient use of
    > the cluster by backfilling more aggressively than Moab. Certainly we
    > see a great deal of activity from backfill.
    >
    > In this respect does anyone understand the mechanism used to reserve
    > nodes/resources for jobs in slurm or potentially where to look for
    > that type of information.
    >
    > Best regards,
    > David
    >
    > SchedulerType=sched/backfill
    > SchedulerParameters=bf_window=3600,bf_resolution=180,bf_max_job_user=4
    >
    > SelectType=select/cons_res
    > SelectTypeParameters=CR_Core
    > FastSchedule=1
    > PriorityFavorSmall=NO
    > PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
    > PriorityType=priority/multifactor
    > PriorityDecayHalfLife=14-0
    >
    > PriorityWeightFairshare=1000000
    > PriorityWeightAge=100000
    > PriorityWeightPartition=0
    > PriorityWeightJobSize=100000
    > PriorityWeightQOS=10000
    > PriorityMaxAge=7-0
    >
    >
    -- 
    Dr. Loris Bennett (Mr.)
    ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de