[slurm-users] Slurm queue seems to be completely blocked

Wed May 13 05:21:46 UTC 2020

Hi Joakim,

one more thing to mention:

Am 11.05.2020 um 19:23 schrieb Joakim Hove:
> 
> ubuntu at ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node
> NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
>     Reason=Low RealMemory [root at 2020-05-11T16:20:02]
> 
> The "State=IDLE+DRAIN" looks a bit suspicious?
> 
> 

I assume, you think it is suspicious, that a node has the states IDLE 
and DRAIN together, right?
But that is fully OK an fairly easy to explain in this case.
There are two different sets of flags and you can here see one the 
states of each.

IDLE could also be ALLOCATED or MIXED
DRAIN could also be e.g. DOWN or FAIL...

It gets clearer, if you look at the sinfo output.

A node with ALLOCATED or MIXED together with DRAIN will be shown as DRAINING
A node with IDLE (no running job, all cores free) together with DRAIN 
will be shown as DRAINED

Best
Marcus

> 
> 
> On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <alex at calicolabs.com 
> <mailto:alex at calicolabs.com>> wrote:
> 
>     You will want to look at the output of 'sinfo' and 'scontrol show
>     node' to see what slurmctld thinks about your compute nodes; then on
>     the compute nodes you will want to check the status of the slurmd
>     service ('systemctl status -l slurmd') and possibly read through the
>     slurmd logs as well.
> 
>     On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.hove at gmail.com
>     <mailto:joakim.hove at gmail.com>> wrote:
> 
>         Hello;
> 
>         I am in the process of familiarizing myself with slurm - I will
>         write a piece of software which will submit jobs to a slurm
>         cluster. Right now I have just made my own "cluster" consisting
>         of one Amazon AWS node and use that to familiarize myself with
>         the sxxx commands - has worked nicely.
> 
>         Now I just brought this AWS node completely to it's knees (not
>         slurm related) and had to stop and start the node from the AWS
>         console - during that process a job managed by slurm was killed
>         hard. Now when the node is back up again slurm refuses to start
>         up jobs - the queue looks like this:
> 
>         ubuntu at ip-172-31-80-232:~$ squeue
>                       JOBID PARTITION     NAME     USER ST       TIME
>           NODES NODELIST(REASON)
>                         186     debug tmp-file www-data PD       0:00  
>             1 (Resources)
>                         187     debug tmp-file www-data PD       0:00  
>             1 (Resources)
>                         188     debug tmp-file www-data PD       0:00  
>             1 (Resources)
>                         189     debug tmp-file www-data PD       0:00  
>             1 (Resources)
> 
>         I.e. the jobs are pending due to Resource reasons, but no jobs
>         are running? I have tried scancel all jobs, but when I add new
>         jobs they again just stay pending. It should be said that when
>         the node/slurm came back up again the offending job which
>         initially created the havoc was still in "Running" state, but
>         the filesystem of that job had been completely wiped so it was
>         not in a sane state. scancel of this job worked fine - but no
>         new jobs will start. Seems like there is "ghost job" blocking
>         the other jobs from starting? I even tried to reinstall slurm
>         using the package manager, but the new slurm installation would
>         still not start jobs. Any tips on how I can proceed to debug this?
> 
>         Regards
> 
>         Joakim
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200513/8b865003/attachment.bin>