[slurm-users] Jobs waiting while plenty of cpu and memory available

Tue Jul 9 00:37:09 UTC 2019

I have a cluster, where I submit a bunch (600) jobs, but the cluster only runs about 20 at a time. By using pestat, I can see there are a bunch of systems with plenty of available cpu and memory.

Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem
                            State Use/Tot              (MB)     (MB)
 pcomp13          batch*     idle   0  72    8.19*   258207   202456
 pcomp14          batch*     idle   0  72    0.00    258207   206558
 pcomp16          batch*     idle   0  72    0.05    258207   230609
 pcomp17          batch*     idle   0  72    8.51*   258207   184492
 pcomp18           batch      mix  14  72    0.29*   258207   230575
 pcomp19          batch*     idle   0  72   10.11*   258207   179604
 pcomp20          batch*     idle   0  72    9.56*   258207   180961
 pcomp21          batch*     idle   0  72    0.10    258207   227255
 pcomp25          batch*     idle   0  72    0.07    258207   218035
 pcomp26          batch*     idle   0  72    0.03    258207   226489
 pcomp27          batch*     idle   0  72    0.25    258207   228580
 pcomp28          batch*     idle   0  72    8.15*   258207   184306
 pcomp29           batch      mix   2  72    0.01*   258207   226256

How can I tell why jobs aren't running? "scontrol show job 123456" shows "JobState=PENDING Reason=Priority" which doesn't shed any light on the situation for me. The pending jobs have requested 1 cpu each and 2G of memory.

Should I just restart slurm daemons? Or is there some way for me to see why these systems aren't running jobs?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190709/2ae250df/attachment.htm>