[slurm-users] status of cloud nodes
nathan norton
nathan at nanoservices.com.au
Tue Jun 18 08:33:05 UTC 2019
Hi all,
I am using slurm with a cloud provider it is all working a treat.
lets say i have 100 nodes all working fine and able to be scheduled,
everything works fine.
$ srun -N100 hostname
works fine.
For some unknown reason after machines shut down for example over the
weekend if no jobs get scheduled for an hour. The next time a job runs
$srun -N90 hostname
fails with:
"srun: Required node not available (down, drained or reserved)"
"srun: job JOBID queued and waiting for resources"
This is weird as no other jobs are running and i should be able to start
up the nodes as requested.
Being 'cloud' type nodes if i run
$scontrol show node
only the up and working nodes are displayed and not the failed nodes.
how do i get the failed nodes information?
if i stop all nodes and run below i can then start up all nodes again
scontrol update NodeName=node-1-100 State=DOWN Reason="undraining"
scontrol update NodeName=node-1-100 State=RESUME
scontrol: show node node
So that fixes it, but i want to figure out why nodes get into this state
and how can i monitor it ? is there a command to get the status of CLOUD
nodes?
any help appreciated
Thanks
Nathan.
More information about the slurm-users
mailing list