[slurm-users] Slurm queue seems to be completely blocked

Mon May 11 17:09:17 UTC 2020

Hello;

I am in the process of familiarizing myself with slurm - I will write a
piece of software which will submit jobs to a slurm cluster. Right now I
have just made my own "cluster" consisting of one Amazon AWS node and use
that to familiarize myself with the sxxx commands - has worked nicely.

Now I just brought this AWS node completely to it's knees (not slurm
related) and had to stop and start the node from the AWS console - during
that process a job managed by slurm was killed hard. Now when the node is
back up again slurm refuses to start up jobs - the queue looks like this:

ubuntu at ip-172-31-80-232:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               186     debug tmp-file www-data PD       0:00      1
(Resources)
               187     debug tmp-file www-data PD       0:00      1
(Resources)
               188     debug tmp-file www-data PD       0:00      1
(Resources)
               189     debug tmp-file www-data PD       0:00      1
(Resources)

I.e. the jobs are pending due to Resource reasons, but no jobs are running?
I have tried scancel all jobs, but when I add new jobs they again just stay
pending. It should be said that when the node/slurm came back up again the
offending job which initially created the havoc was still in "Running"
state, but the filesystem of that job had been completely wiped so it was
not in a sane state. scancel of this job worked fine - but no new jobs will
start. Seems like there is "ghost job" blocking the other jobs from
starting? I even tried to reinstall slurm using the package manager, but
the new slurm installation would still not start jobs. Any tips on how I
can proceed to debug this?

Regards

Joakim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/58f3ab05/attachment-0001.htm>