[slurm-users] Need to restart slurmctld for gres jobs to start

Ryan Novosielski novosirj at rutgers.edu
Fri Jun 24 20:56:55 UTC 2022


On 6/2/22 14:02, tluchko wrote:
> Hello,
> 
> I have recently started to have problems where jobs sit in the queue 
> waiting for resources to become available, even when the resources are 
> available. If I stop and restart slurmctld, the pending jobs start running.
> 
> This seems to be related to GRES jobs.  I have seven nodes with
> 
> Gres=bandwidth:ib:no_consume:1G
> 
> four nodes with
> 
> Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G
> 
> and one node with.
> 
> Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G
> 
> Jobs only sit in the queue with RESOURCES as the REASON when we include 
> the flag --gres=bandwidth:ib.  If we remove the flag, the jobs run fine. 
>   But we need the flag to ensure that we don't get a mix of IB and 
> ethernet nodes because they fail in this case.
> 
> It seems that once a node completes a job with --gres=bandwidth:ib it 
> won't run another job with this setting until I restart slurmctld.
> 
> The only error I can find is in /var/log/slurm/slurmctld.log
> 
> [2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc 
> StepId=140569.0 dealloc, node_in_use is NULL
> 
> These jobs were running consistently but then started giving us trouble 
> about a month ago. I have tried restarting slurmd on all nodes and 
> slurmctld.  Restarting slurmctld does provide a temporary fix.
> 
> I'm using Slurm 21.08.3 and Rocky Linux release 8.5.
> 
> Do you have any suggestions as to what is wrong or how to fix it?
> 
> Thank you,
> 
> Tyler

Another alternate way to deal with this is the topology plugin. We use 
this to keep jobs from spanning two different infiniband fabrics that 
are connected together via lower bandwidth than the rest of the fabric.

-- 
#BlackLivesMatter
____
  || \\UTGERS,     |----------------------*O*------------------------
  ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
  || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
  ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
       `'



More information about the slurm-users mailing list