[slurm-users] GPU jobs not running correctly

Fri Aug 20 01:05:28 UTC 2021

What SLURM version are you running?

What are the #SLURM directives in the batch script? (or the sbatch
arguments)

When the single GPU jobs are pending, what's the output of 'scontrol show
job JOBID'?

What are the node definitions in slurm.conf, and the lines in gres.conf?

Are the nodes all the same host platform (motherboard)?

We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX 1s,
A6000s, and A40s, with a mix of single and dual-root platforms, and haven't
seen this problem with SLURM 20.02.6 or earlier versions.

On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyutinag at gmail.com>
wrote:

> Hello,
>
> We are in the process of finishing up the setup of a cluster with 3 nodes,
> 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
> asking for 1 GPU in the submission script will wait to run on the 3090
> node, no matter resource availability. Same job requesting 2 or more GPUs
> will run on any node. I don't even know where to begin troubleshooting this
> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
> help would be appreciated. (If helpful - this cluster is used for
> structural biology, with cryosparc and relion packages).
>
> Thank you,
> Andrey
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/4e2636a0/attachment.htm>