[slurm-users] Need to restart slurmctld for gres jobs to start

Thu Jun 2 18:02:17 UTC 2022

Hello,

I have recently started to have problems where jobs sit in the queue waiting for resources to become available, even when the resources are available. If I stop and restart slurmctld, the pending jobs start running.

This seems to be related to GRES jobs. I have seven nodes with

Gres=bandwidth:ib:no_consume:1G

four nodes with

Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G

and one node with.

Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G

Jobs only sit in the queue with RESOURCES as the REASON when we include the flag --gres=bandwidth:ib. If we remove the flag, the jobs run fine. But we need the flag to ensure that we don't get a mix of IB and ethernet nodes because they fail in this case.

It seems that once a node completes a job with --gres=bandwidth:ib it won't run another job with this setting until I restart slurmctld.

The only error I can find is in /var/log/slurm/slurmctld.log

[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc StepId=140569.0 dealloc, node_in_use is NULL

These jobs were running consistently but then started giving us trouble about a month ago. I have tried restarting slurmd on all nodes and slurmctld. Restarting slurmctld does provide a temporary fix.

I'm using Slurm 21.08.3 and Rocky Linux release 8.5.

Do you have any suggestions as to what is wrong or how to fix it?

Thank you,

Tyler

Sent with [Proton Mail](https://proton.me/) secure email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220602/f1658e86/attachment-0001.htm>