[slurm-users] how to find out why a job won't run?

Mon Nov 26 06:24:46 MST 2018

Steve,

This doesn't really address your question, and I am guessing you are
aware of this; however, since you did not mention it:  "scontrol show
job <jobid>" will give you a lot of detail about a job (a lot more
than squeue).  It's "Reason" is the same as sinfo and squeue, though.
So no help there.  I've always found that it is a bit of a detective
exercise.  In the end, though, there's always a reason.  It's just
sometimes very subtle.  For example, we use "Features" so that users
can constrain their jobs based on various factors (e.g., CPU
architecture), and we'll sometimes have users ask for something like a
"Haswell" processor and 190 GB of memory ... but we only have that
much on our Skylake machines.  So the "reason" can be very non-linear.

Sadly, I don't know of an easy tool that just looks at all the data
and tells you or gives you better clues.  I agree that that would be
very helpful.

As to preemptable, do you have "checkpoint" enabled via SLURM?  There
are situations in which a SLURM-checkpointed job will still occupy
some memory, and a pending job cannot deploy because that memory is in
use, even though the job was suspended.  Perhaps someone on the list
with more experience using the preemptable partitions/QoS *WITH* the
SLURM checkpointing flag enabled could speak to this?  As Steve knows,
we just cancel the job when it is preempted.

Paul.

On Mon, Nov 26, 2018 at 3:22 AM Daan van Rossum <d.r.vanrossum at gmx.de> wrote:
>
> I'm also interested in this.  Another example: "Reason=(ReqNodeNotAvail)" is all that a user sees in a situation when his/her job's walltime runs into a system maintenance reservation.
>
> * on Friday, 2018-11-23 09:55 -0500, Steven Dick <kg4ydw at gmail.com> wrote:
>
> > I'm looking for a tool that will tell me why a specific job in the
> > queue is still waiting to run.  squeue doesn't give enough detail.  If
> > the job is held up on QOS, it's pretty obvious.  But if it's
> > resources, it's difficult to tell.
> >
> > If a job is not running because of resources, how can I identify which
> > resource is not available?  In a few cases, I've looked at what the
> > job asked for and found a node that has those resources free, but
> > still can't figure out why it isn't running.
> >
> > Also, if there are preemptable jobs in the queue, why is the job
> > waiting on resources?  Is there a priority for running jobs that can
> > be compared to waiting jobs?