[slurm-users] Need to restart slurmctld for gres jobs to start

tluchko tluchko at protonmail.com
Thu Jun 2 18:02:17 UTC 2022


I have recently started to have problems where jobs sit in the queue waiting for resources to become available, even when the resources are available. If I stop and restart slurmctld, the pending jobs start running.

This seems to be related to GRES jobs. I have seven nodes with


four nodes with


and one node with.


Jobs only sit in the queue with RESOURCES as the REASON when we include the flag --gres=bandwidth:ib. If we remove the flag, the jobs run fine. But we need the flag to ensure that we don't get a mix of IB and ethernet nodes because they fail in this case.

It seems that once a node completes a job with --gres=bandwidth:ib it won't run another job with this setting until I restart slurmctld.

The only error I can find is in /var/log/slurm/slurmctld.log

[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc StepId=140569.0 dealloc, node_in_use is NULL

These jobs were running consistently but then started giving us trouble about a month ago. I have tried restarting slurmd on all nodes and slurmctld. Restarting slurmctld does provide a temporary fix.

I'm using Slurm 21.08.3 and Rocky Linux release 8.5.

Do you have any suggestions as to what is wrong or how to fix it?

Thank you,


Sent with [Proton Mail](https://proton.me/) secure email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220602/f1658e86/attachment-0001.htm>

More information about the slurm-users mailing list