[slurm-users] Jobs waiting while plenty of cpu and memory available
Edward Ned Harvey (slurm)
slurm at nedharvey.com
Tue Jul 9 00:37:09 UTC 2019
I have a cluster, where I submit a bunch (600) jobs, but the cluster only runs about 20 at a time. By using pestat, I can see there are a bunch of systems with plenty of available cpu and memory.
Hostname Partition Node Num_CPU CPUload Memsize Freemem
State Use/Tot (MB) (MB)
pcomp13 batch* idle 0 72 8.19* 258207 202456
pcomp14 batch* idle 0 72 0.00 258207 206558
pcomp16 batch* idle 0 72 0.05 258207 230609
pcomp17 batch* idle 0 72 8.51* 258207 184492
pcomp18 batch mix 14 72 0.29* 258207 230575
pcomp19 batch* idle 0 72 10.11* 258207 179604
pcomp20 batch* idle 0 72 9.56* 258207 180961
pcomp21 batch* idle 0 72 0.10 258207 227255
pcomp25 batch* idle 0 72 0.07 258207 218035
pcomp26 batch* idle 0 72 0.03 258207 226489
pcomp27 batch* idle 0 72 0.25 258207 228580
pcomp28 batch* idle 0 72 8.15* 258207 184306
pcomp29 batch mix 2 72 0.01* 258207 226256
How can I tell why jobs aren't running? "scontrol show job 123456" shows "JobState=PENDING Reason=Priority" which doesn't shed any light on the situation for me. The pending jobs have requested 1 cpu each and 2G of memory.
Should I just restart slurm daemons? Or is there some way for me to see why these systems aren't running jobs?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190709/2ae250df/attachment.htm>
More information about the slurm-users
mailing list