[slurm-users] Need to restart slurmctld for gres jobs to start
tluchko at protonmail.com
Thu Jun 2 18:02:17 UTC 2022
I have recently started to have problems where jobs sit in the queue waiting for resources to become available, even when the resources are available. If I stop and restart slurmctld, the pending jobs start running.
This seems to be related to GRES jobs. I have seven nodes with
four nodes with
and one node with.
Jobs only sit in the queue with RESOURCES as the REASON when we include the flag --gres=bandwidth:ib. If we remove the flag, the jobs run fine. But we need the flag to ensure that we don't get a mix of IB and ethernet nodes because they fail in this case.
It seems that once a node completes a job with --gres=bandwidth:ib it won't run another job with this setting until I restart slurmctld.
The only error I can find is in /var/log/slurm/slurmctld.log
[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc StepId=140569.0 dealloc, node_in_use is NULL
These jobs were running consistently but then started giving us trouble about a month ago. I have tried restarting slurmd on all nodes and slurmctld. Restarting slurmctld does provide a temporary fix.
I'm using Slurm 21.08.3 and Rocky Linux release 8.5.
Do you have any suggestions as to what is wrong or how to fix it?
Sent with [Proton Mail](https://proton.me/) secure email.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users