[slurm-users] status of cloud nodes
nathan norton
nathan at nanoservices.com.au
Wed Jun 19 04:36:56 UTC 2019
Hi,
Just tried running that command, but it only shows nodes that are up and
running, doesn’t tell me about any nodes that are down and turned off, as
an example please see below. There is a job running that should be using
the 100 nodes but only 52 are allocated (plus 2 down* (that I know about
and don’t care about in this case)) where are the stats and details on why
the 40ish other nodes are not being used? (nothing in the masters log file
either)
btuser at bt_slurm_login001 ~ % tail /etc/slurm/slurm.conf
NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1
CoresPerSocket=16 ThreadsPerCore=1 State=CLOUD
NodeName=bt_slurm_login00[1-10] State=DOWN # these are the login nodes
PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300
Oversubscribe=NO State=UP Priority=1 PreemptMode=requeue
btuser at bt_slurm_login001 ~ % sinfo -p backtest
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
backtest* up 5:00:00 2 down* ip-10-0-8-[29-30]
backtest* up 5:00:00 52 alloc ip-10-0-8-[4-17,19-24,26-28,31-59]
btuser at bt_slurm_login001 ~ %
btuser at bt_slurm_login001 ~ % sinfo -p backtest -Rl -O
reason:35,user,timestamp,statelong,nodelist
Wed Jun 19 01:24:59 2019
REASON USER TIMESTAMP
STATE NODELIST
Not responding root 2019-06-04T04:09:31
down* ip-10-0-8-[29-30]
On Tue., 18 Jun. 2019, 9:32 pm Sam Gallop (NBI), <sam.gallop at nbi.ac.uk>
wrote:
> Hi Nathan,
>
> The command I use to get the reason for failed nodes is ... 'sinfo -Ral'.
> If you need to extend the width of the output then ... 'sinfo -Ral -O
> reason:35,user,timestamp,statelong,nodelist'.
>
> Using the timestamp of the failure look in the slurmd or slurmctld logs.
>
> ---
> Sam Gallop
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> nathan norton
> Sent: 18 June 2019 09:33
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] status of cloud nodes
>
> Hi all,
>
> I am using slurm with a cloud provider it is all working a treat.
>
> lets say i have 100 nodes all working fine and able to be scheduled,
> everything works fine.
>
> $ srun -N100 hostname
>
> works fine.
>
> For some unknown reason after machines shut down for example over the
> weekend if no jobs get scheduled for an hour. The next time a job runs
>
> $srun -N90 hostname
>
> fails with:
>
> "srun: Required node not available (down, drained or reserved)"
>
> "srun: job JOBID queued and waiting for resources"
>
> This is weird as no other jobs are running and i should be able to start
> up the nodes as requested.
>
>
> Being 'cloud' type nodes if i run
>
> $scontrol show node
>
> only the up and working nodes are displayed and not the failed nodes.
> how do i get the failed nodes information?
>
> if i stop all nodes and run below i can then start up all nodes again
>
> scontrol update NodeName=node-1-100 State=DOWN Reason="undraining"
> scontrol update NodeName=node-1-100 State=RESUME
> scontrol: show node node
>
>
> So that fixes it, but i want to figure out why nodes get into this state
> and how can i monitor it ? is there a command to get the status of CLOUD
> nodes?
>
> any help appreciated
>
> Thanks
>
> Nathan.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190619/5cde2134/attachment-0001.html>
More information about the slurm-users
mailing list