[slurm-users] only 1 job running

Thu Jan 28 20:42:01 UTC 2021

Christopher Samuel wrote on 1/28/21 12:50:
> Did you restart the slurm daemons when you added the new node?  Some internal data structures (bitmaps) are build based on the number of nodes and they need to be rebuild with a restart in this situation.
> 
> https://slurm.schedmd.com/faq.html#add_nodes

Ok this seems to have helped a little.  After restarting the services and killing the task and restarting it as well, more jobs are running.

Still,

1. n[011-013] were in drain state, had to update those to "idle".
2. n[011-013] are also now running 8 jobs, but they should be running 16

n010 is running 16 jobs now as expected.

The squeue looks like:

              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                247      defq cromwell smrtanal PD       0:00      1 (Resources)
                248      defq cromwell smrtanal PD       0:00      1 (Priority)
                ...
                278      defq cromwell smrtanal PD       0:00      1 (Priority)
                207      defq cromwell smrtanal  R       5:01      1 n010
                ...
                222      defq cromwell smrtanal  R       5:01      1 n010
                223      defq cromwell smrtanal  R       4:55      1 n011
                ...
                230      defq cromwell smrtanal  R       4:55      1 n011
                231      defq cromwell smrtanal  R       4:55      1 n012
                ...
                238      defq cromwell smrtanal  R       4:55      1 n012
                239      defq cromwell smrtanal  R       4:55      1 n013
                ...
                246      defq cromwell smrtanal  R       4:55      1 n013

need to get the (Priority) and (Resources) jobs running on n[011-013]...