[slurm-users] Overzealous PartitionQoS Limits
Christoph Brüning
christoph.bruening at uni-wuerzburg.de
Wed May 20 10:00:31 UTC 2020
Dear all,
we set up a floating partition as described in SLURM's QoS documentation
to allow for jobs with a longer than usual walltime on a part of our
cluster: QoS with GrpCPUs and GrpNodes limits attached to the
longer-walltime partition which contains all nodes.
We observe that jobs are stuck in the queue like:
$ squeue -o "%.7i %.9P %.2t %.6C %.20S %R"
JOBID PARTITION ST CPUS START_TIME NODELIST(REASON)
1108810 long PD 2 N/A (QOSGrpNodeLimit)
1108811 long PD 2 N/A (QOSGrpNodeLimit)
1108812 long PD 2 N/A (QOSGrpNodeLimit)
1108813 long PD 2 N/A (QOSGrpNodeLimit)
1108814 long PD 2 N/A (QOSGrpNodeLimit)
1108815 long PD 2 N/A (QOSGrpNodeLimit)
1108816 long PD 2 N/A (QOSGrpNodeLimit)
1108817 long PD 2 N/A (QOSGrpNodeLimit)
1108818 long PD 2 N/A (QOSGrpNodeLimit)
[...]
However, we are not even close to any of the GrpNodes or GrpCPUs limits.
And there are nodes in MIXED state that should have slots for two-CPU
jobs available.
The mentioned jobs even have the highest priority (except for two jobs
on a special-hardware partition), and they have an empty "Dependency="
field.
It seems that those jobs are occasionally assigned a start time when the
scheduler runs, but that is quickly reverted to "N/A".
Did any of you observe this or similar behaviour?
FWIW, we are running SLURM 17.11 on Debian, an upgrade to 19.05 is
scheduled in the next couple of weeks.
Best,
Christoph
--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499
More information about the slurm-users
mailing list