[slurm-users] [External] Re: Slurm queue seems to be completely blocked

Michael Robbert mrobbert at mines.edu
Mon May 11 17:53:50 UTC 2020

You’re on the right track with the DRAIN state. The more specific answer is in the “Reason=” description on the last line. 

It looks like your node has less memory than what you’ve defined for the node in slurm.conf




From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Joakim Hove <joakim.hove at gmail.com>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Monday, May 11, 2020 at 11:25
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [External] Re: [slurm-users] Slurm queue seems to be completely blocked


CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.



ubuntu at ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node
NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00
   NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11
   OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC 2020 
   RealMemory=983 AllocMem=0 FreeMem=355 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T17:02:27
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root at 2020-05-11T16:20:02]


The "State=IDLE+DRAIN" looks a bit suspicious?





On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <alex at calicolabs.com> wrote:

You will want to look at the output of 'sinfo' and 'scontrol show node' to see what slurmctld thinks about your compute nodes; then on the compute nodes you will want to check the status of the slurmd service ('systemctl status -l slurmd') and possibly read through the slurmd logs as well.


On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.hove at gmail.com> wrote:



I am in the process of familiarizing myself with slurm - I will write a piece of software which will submit jobs to a slurm cluster. Right now I have just made my own "cluster" consisting of one Amazon AWS node and use that to familiarize myself with the sxxx commands - has worked nicely.


Now I just brought this AWS node completely to it's knees (not slurm related) and had to stop and start the node from the AWS console - during that process a job managed by slurm was killed hard. Now when the node is back up again slurm refuses to start up jobs - the queue looks like this:


ubuntu at ip-172-31-80-232:~$ squeue
               186     debug tmp-file www-data PD       0:00      1 (Resources)
               187     debug tmp-file www-data PD       0:00      1 (Resources)
               188     debug tmp-file www-data PD       0:00      1 (Resources)
               189     debug tmp-file www-data PD       0:00      1 (Resources)


I.e. the jobs are pending due to Resource reasons, but no jobs are running? I have tried scancel all jobs, but when I add new jobs they again just stay pending. It should be said that when the node/slurm came back up again the offending job which initially created the havoc was still in "Running" state, but the filesystem of that job had been completely wiped so it was not in a sane state. scancel of this job worked fine - but no new jobs will start. Seems like there is "ghost job" blocking the other jobs from starting? I even tried to reinstall slurm using the package manager, but the new slurm installation would still not start jobs. Any tips on how I can proceed to debug this?





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/21ee7b41/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5173 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/21ee7b41/attachment.bin>

More information about the slurm-users mailing list