[slurm-users] Troubleshooting scheduling

Sun Mar 25 12:50:17 MDT 2018

Hi everyone,

Is there a guide anywhere on how to figure out why jobs aren't being
started?

We have a cluster with nodes of mixed sizes/powers currently roughly half
the cluster is idle even though there are ~5k jobs queued.
All jobs are queued due to priority while only 1 job is marked as waiting
for resources, the job waiting for resources needs a tiny bit more RAM then
slurm shows available for all the idle nodes (62.5G vs 62G in sinfo but if
you ask 'free -m' it would be 62.8G).
So the job would seem to be waiting for larger nodes which is fine but what
I don't understand is why the several thousand other jobs that have very
modest memory requests 4-8G aren't starting on the small nodes.

We're using Slurm 2017.11 with the sched/backfill scheduler.

My logic says that if the job is waiting for the larger nodes then the
smaller jobs can easily be filled with small jobs without harming its'
start time....

So how/where can I see what/why slurm is not starting these jobs.

One other thing when I submit a tiny job (hostname) slurm doesn't even
stick it on the idle nodes but instead fits it in on the nodes currently
already in use, only if I explicitely request an idle node does the test
job go there.

Thanks!
Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180325/152e4afb/attachment.html>