[slurm-users] only 1 job running

Thu Jan 28 05:28:08 UTC 2021

Hi list, we have a new cluster setup with Bright cluster manager.  Looking into a support contract there, but trying to get community support in the mean time.  I'm sure things were working when the cluster was delivered, but I provisioned an additional node and now the scheduler isn't quite working right.

The new node I provisioned had slightly different disk layout, so had to provision it a bit differently from the other nodes.  I made some changes to the slurm queue as well, within the Bright cluser manager, to account for the additional resources, but I must've messed something up.  Now, only 1 job is running on the node, when there should be 16 running.  Further, there are no jobs running on the original nodes.

When I run squeue, this is the output:

              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                156      defq cromwell smrtuser PD       0:00      1 (Resources)
                157      defq cromwell smrtuser PD       0:00      1 (Priority)
                ...
                203      defq cromwell smrtuser PD       0:00      1 (Priority)
                155      defq cromwell smrtuser  R      39:35      1 n010

Job 155 is running on the "new" node and as you can see the other jobs are stuck not running anywhere.  Once job 155 finishes the next one will start, so the queue is working although slowly.

I'm new to all this so not sure where to start troubleshooting.  I'd like to get these other jobs started so our task can be completed in a timely manner, and figure out why only 1 job is running when they all should be running.

Thanks
-- 
Chandler Sobel-Sorenson / Systems Administrator
Arizona Genomics Institute
University of Arizona