[slurm-users] Slurm queue seems to be completely blocked

Mon May 11 17:23:35 UTC 2020

ubuntu at ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node
NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11
   OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC
2020
   RealMemory=983 AllocMem=0 FreeMem=355 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T17:02:27
   CfgTRES=cpu=1,mem=983M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root at 2020-05-11T16:20:02]

The "State=IDLE+DRAIN" looks a bit suspicious?

On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <alex at calicolabs.com> wrote:

> You will want to look at the output of 'sinfo' and 'scontrol show node' to
> see what slurmctld thinks about your compute nodes; then on the compute
> nodes you will want to check the status of the slurmd service ('systemctl
> status -l slurmd') and possibly read through the slurmd logs as well.
>
> On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.hove at gmail.com>
> wrote:
>
>> Hello;
>>
>> I am in the process of familiarizing myself with slurm - I will write a
>> piece of software which will submit jobs to a slurm cluster. Right now I
>> have just made my own "cluster" consisting of one Amazon AWS node and use
>> that to familiarize myself with the sxxx commands - has worked nicely.
>>
>> Now I just brought this AWS node completely to it's knees (not slurm
>> related) and had to stop and start the node from the AWS console - during
>> that process a job managed by slurm was killed hard. Now when the node is
>> back up again slurm refuses to start up jobs - the queue looks like this:
>>
>> ubuntu at ip-172-31-80-232:~$ squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                186     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                187     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                188     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                189     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>
>> I.e. the jobs are pending due to Resource reasons, but no jobs are
>> running? I have tried scancel all jobs, but when I add new jobs they again
>> just stay pending. It should be said that when the node/slurm came back up
>> again the offending job which initially created the havoc was still in
>> "Running" state, but the filesystem of that job had been completely wiped
>> so it was not in a sane state. scancel of this job worked fine - but no new
>> jobs will start. Seems like there is "ghost job" blocking the other jobs
>> from starting? I even tried to reinstall slurm using the package manager,
>> but the new slurm installation would still not start jobs. Any tips on how
>> I can proceed to debug this?
>>
>> Regards
>>
>> Joakim
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/2263069a/attachment.htm>