[slurm-users] Slurm queue seems to be completely blocked

Mon May 11 17:14:41 UTC 2020

You will want to look at the output of 'sinfo' and 'scontrol show node' to
see what slurmctld thinks about your compute nodes; then on the compute
nodes you will want to check the status of the slurmd service ('systemctl
status -l slurmd') and possibly read through the slurmd logs as well.

On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.hove at gmail.com> wrote:

> Hello;
>
> I am in the process of familiarizing myself with slurm - I will write a
> piece of software which will submit jobs to a slurm cluster. Right now I
> have just made my own "cluster" consisting of one Amazon AWS node and use
> that to familiarize myself with the sxxx commands - has worked nicely.
>
> Now I just brought this AWS node completely to it's knees (not slurm
> related) and had to stop and start the node from the AWS console - during
> that process a job managed by slurm was killed hard. Now when the node is
> back up again slurm refuses to start up jobs - the queue looks like this:
>
> ubuntu at ip-172-31-80-232:~$ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                186     debug tmp-file www-data PD       0:00      1
> (Resources)
>                187     debug tmp-file www-data PD       0:00      1
> (Resources)
>                188     debug tmp-file www-data PD       0:00      1
> (Resources)
>                189     debug tmp-file www-data PD       0:00      1
> (Resources)
>
> I.e. the jobs are pending due to Resource reasons, but no jobs are
> running? I have tried scancel all jobs, but when I add new jobs they again
> just stay pending. It should be said that when the node/slurm came back up
> again the offending job which initially created the havoc was still in
> "Running" state, but the filesystem of that job had been completely wiped
> so it was not in a sane state. scancel of this job worked fine - but no new
> jobs will start. Seems like there is "ghost job" blocking the other jobs
> from starting? I even tried to reinstall slurm using the package manager,
> but the new slurm installation would still not start jobs. Any tips on how
> I can proceed to debug this?
>
> Regards
>
> Joakim
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/03139939/attachment.htm>