[slurm-users] Slurm queue seems to be completely blocked

Mon May 11 17:31:46 UTC 2020

ubuntu at ip-172-31-80-232:/var/run/slurm-llnl$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  drain ip-172-31-80-232

● slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor
preset: enabled)
   Active: active (running) since Mon 2020-05-11 17:30:43 UTC; 25s ago
     Docs: man:slurmd(8)
  Process: 2547 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited,
status=0/SUCCESS)
 Main PID: 2567 (slurmd)
    Tasks: 1 (limit: 1121)
   CGroup: /system.slice/slurmd.service
           └─2567 /usr/sbin/slurmd

May 11 17:30:43 ip-172-31-80-232 systemd[1]: Starting Slurm node daemon...
May 11 17:30:43 ip-172-31-80-232 systemd[1]: slurmd.service: Can't open PID
file /var/run/slurm-llnl/slurmd.pid (yet?) after start: No such file or
directory
May 11 17:30:43 ip-172-31-80-232 systemd[1]: Started Slurm node daemon.

This looks reasonable to me?

On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <alex at calicolabs.com> wrote:

> You will want to look at the output of 'sinfo' and 'scontrol show node' to
> see what slurmctld thinks about your compute nodes; then on the compute
> nodes you will want to check the status of the slurmd service ('systemctl
> status -l slurmd') and possibly read through the slurmd logs as well.
>
> On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.hove at gmail.com>
> wrote:
>
>> Hello;
>>
>> I am in the process of familiarizing myself with slurm - I will write a
>> piece of software which will submit jobs to a slurm cluster. Right now I
>> have just made my own "cluster" consisting of one Amazon AWS node and use
>> that to familiarize myself with the sxxx commands - has worked nicely.
>>
>> Now I just brought this AWS node completely to it's knees (not slurm
>> related) and had to stop and start the node from the AWS console - during
>> that process a job managed by slurm was killed hard. Now when the node is
>> back up again slurm refuses to start up jobs - the queue looks like this:
>>
>> ubuntu at ip-172-31-80-232:~$ squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                186     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                187     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                188     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>                189     debug tmp-file www-data PD       0:00      1
>> (Resources)
>>
>> I.e. the jobs are pending due to Resource reasons, but no jobs are
>> running? I have tried scancel all jobs, but when I add new jobs they again
>> just stay pending. It should be said that when the node/slurm came back up
>> again the offending job which initially created the havoc was still in
>> "Running" state, but the filesystem of that job had been completely wiped
>> so it was not in a sane state. scancel of this job worked fine - but no new
>> jobs will start. Seems like there is "ghost job" blocking the other jobs
>> from starting? I even tried to reinstall slurm using the package manager,
>> but the new slurm installation would still not start jobs. Any tips on how
>> I can proceed to debug this?
>>
>> Regards
>>
>> Joakim
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/f67efb6e/attachment-0001.htm>