[slurm-users] Overzealous PartitionQoS Limits

Wed May 20 11:38:02 UTC 2020

Quick update:

When we increase the GrpNodes limit, some of the jobs start running.
However, they run on nodes that already have jobs from the "long" 
partition running.
To my understanding, that should node change the node count against 
which the GrpNodes limit is applied...

Best,
Christoph

On 20/05/2020 12.00, Christoph Brüning wrote:
> Dear all,
> 
> we set up a floating partition as described in SLURM's QoS documentation 
> to allow for jobs with a longer than usual walltime on a part of our 
> cluster: QoS with GrpCPUs and GrpNodes limits attached to the 
> longer-walltime partition which contains all nodes.
> 
> We observe that jobs are stuck in the queue like:
> 
> $ squeue -o "%.7i %.9P %.2t %.6C %.20S %R"
>    JOBID PARTITION ST   CPUS           START_TIME NODELIST(REASON)
> 1108810      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108811      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108812      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108813      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108814      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108815      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108816      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108817      long PD      2                  N/A (QOSGrpNodeLimit)
> 1108818      long PD      2                  N/A (QOSGrpNodeLimit)
> [...]
> 
> However, we are not even close to any of the GrpNodes or GrpCPUs limits.
> And there are nodes in MIXED state that should have slots for two-CPU 
> jobs available.
> The mentioned jobs even have the highest priority (except for two jobs 
> on a special-hardware partition), and they have an empty "Dependency=" 
> field.
> 
> It seems that those jobs are occasionally assigned a start time when the 
> scheduler runs, but that is quickly reverted to "N/A".
> 
> Did any of you observe this or similar behaviour?
> FWIW, we are running SLURM 17.11 on Debian, an upgrade to 19.05 is 
> scheduled in the next couple of weeks.
> 
> Best,
> Christoph
> 
> 

-- 
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499