[slurm-users] Excessive use of backfill on a cluster

Bjørn-Helge Mevik b.h.mevik at usit.uio.no
Tue Nov 20 06:51:44 MST 2018


It might be unrelated, but I remember we had some similar problems when
setting up a new cluster two years ago.  I don't remember the details,
but I believe it was related to qos'es overriding partition limits.
Jobs in these qos'es (with requests that exceeded a partition limit like
the minimum number of nodes) were started fine by the backfiller, but
not by the scheduler.  It turned out that the checks for this was ok in
the backfiller, but had a bug in the scheduler.  The scheduler bug was
fixed in this particular case, but it might be that you are hit by
something similar.

(This was a while ago, and I do remember someone at SchedMD mentioning
that they were going to "de-duplicate" the scheduler and backfiller code
in the future, but I don't know how far they've gotten with it.)

-- 
Regads,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181120/c046aa0a/attachment-0001.sig>


More information about the slurm-users mailing list