[slurm-users] How to show state of CLOUD nodes

Kirill 'kkm' Katsnelson kkm at pobox.com
Fri Feb 28 11:56:06 UTC 2020


I'm running clusters entirely in Google Cloud. I'm not sure I'm
understanding the issue--do the nodes disappear from view entirely only
when they fail to power up by ResumeTimeout? Failures of this kind are
happening in GCE when resources are momentarily unavailable, but the nodes
are still there, only shown as DOWN. FWIW, I'm currently using 19.05.4-1.

I have a trigger on the controller to catch and return these nodes back to
POWER_SAVE. The offset of 20s lets all moving parts to settle; in any case,
Slurm batches trigger runs internally, on a 15s schedule IIRC, so it's not
precise. --flags=PERM makes the trigger permanent, so you need to install
it once and for all:

strigger --set --down --flags=PERM --offset=20 --program=$script

and the $script points to the full path (on the controller) of the
following script. I'm copying the Log function and the _logname gymnastics
from a file which is dot-sourced by the main program in my setup, as it's
part of a larger set of scripts; it's more complex than it has to be for
your case, but I did not want to introduce a bug by hastily paring it down.
You'll do that if you want.

  ----8<--------8<--------8<----
#!/bin/bash

set -u

# Tidy up name for logging: '.../slurm_resume.sh' => 'slurm-resume'
_logname=$(basename "$0")
_logname=${_logname%%.*}
_logname=${_logname//_/-}

Log() {
  local level=$1; shift;
  [[ $level == *.* ]] || level=daemon.$level  # So we can use e.g.
auth.notice.
  logger -p $level -t $_logname -- "$@"
}

reason=recovery

for n; do
  Log notice "Recovering failed node(s) '$n'"
  scontrol update nodename="$n" reason="$reason" state=DRAIN &&
  scontrol update nodename="$n" reason="$reason" state=POWER_DOWN ||
    Log alert "The command 'scontrol update nodename=$n' failed." \
              "Is scontrol on PATH?"
done

exit 0
  ----8<--------8<--------8<----

The sequence of DRAIN first then POWER_DOWN is a magic left over from v18;
see if POWER_DOWN alone does the trick. Or don't, as long as it works :)

Also make sure you have (some of) the following in slurm.conf, assuming EC2
provides DNS name resolution--GCE does.

# Important for cloud: do not assume the nodes will retain their IP
# addresses, and do not cache name-to-IP mapping.
CommunicationParameters=NoAddrCache
SlurmctldParameters=cloud_dns,idle_on_node_suspend
PrivateData=cloud   # Always show cloud nodes.
ReturnToService=2   # When a DOWN node boots, it becomes available.

Hope this might help.

 -kkm

On Thu, Feb 27, 2020 at 4:11 PM Carter, Allan <cartalla at amazon.com> wrote:

> I’m setting up an EC2 SLURM cluster and when an instance doesn’t resume
> fast enough I get an error like:
>
>
>
> node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and
> power_save
>
>
>
> I keep running into issues where my cloud nodes do not show up in sinfo
> and I can’t display their information with scontrol. This makes it
> difficult to know which of my CLOUD nodes are available for scheduling and
> which are down for some reason and can’t be used. I haven’t figured out
> when slurm will show a cloud node and when it won’t and this make it pretty
> hard to manage the cluster.
>
>
>
> Would I be better off just removing the CLOUD attribute on my EC2 nodes?
> What is the advantage of making them CLOUD nodes if it just make it more
> difficult to manage the cluster?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200228/d227d24a/attachment.htm>


More information about the slurm-users mailing list