[slurm-users] only 1 job running
Chandler
admin at genome.arizona.edu
Thu Jan 28 20:42:01 UTC 2021
Christopher Samuel wrote on 1/28/21 12:50:
> Did you restart the slurm daemons when you added the new node? Some internal data structures (bitmaps) are build based on the number of nodes and they need to be rebuild with a restart in this situation.
>
> https://slurm.schedmd.com/faq.html#add_nodes
Ok this seems to have helped a little. After restarting the services and killing the task and restarting it as well, more jobs are running.
Still,
1. n[011-013] were in drain state, had to update those to "idle".
2. n[011-013] are also now running 8 jobs, but they should be running 16
n010 is running 16 jobs now as expected.
The squeue looks like:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
247 defq cromwell smrtanal PD 0:00 1 (Resources)
248 defq cromwell smrtanal PD 0:00 1 (Priority)
...
278 defq cromwell smrtanal PD 0:00 1 (Priority)
207 defq cromwell smrtanal R 5:01 1 n010
...
222 defq cromwell smrtanal R 5:01 1 n010
223 defq cromwell smrtanal R 4:55 1 n011
...
230 defq cromwell smrtanal R 4:55 1 n011
231 defq cromwell smrtanal R 4:55 1 n012
...
238 defq cromwell smrtanal R 4:55 1 n012
239 defq cromwell smrtanal R 4:55 1 n013
...
246 defq cromwell smrtanal R 4:55 1 n013
need to get the (Priority) and (Resources) jobs running on n[011-013]...
More information about the slurm-users
mailing list