[slurm-users] error: power_save module disabled, NULL SuspendProgram

Wed Mar 29 13:43:22 UTC 2023

Am Wed, 29 Mar 2023 14:42:33 +0200
schrieb Ben Polman <Ben.Polman at science.ru.nl>:

> I'd be interested in your kludge, we face a similar situation where the 
> slurmctld node
> does not have access to the ipmi network and can not ssh to machines 
> that have access.
> We are thinking on creating a rest interface to a control server which 
> would be running the ipmi commands

We settled on transient files in /dev/shm on the slurmctld side as
"API". You could call it in-memory transactional database;-)

#!/bin/sh
# node-suspend and node-resume (symlinked) script

powerdir=/dev/shm/powersave
scontrol=$(cd "$(dirname "$0")" && pwd)/scontrol
hostlist=$1

case $0 in
*-suspend)
  subdir=suspend
;;
*-resume)
  subdir=resume
;;
esac

mkdir -p "$powerdir/$subdir" &&
cd "$powerdir/$subdir" &&
tmp=$(mktemp XXXXXXX.tmp) &&
$scontrol show hostnames "$hostlist" > "$tmp" &&
echo "$(date +%Y%m%d-%H%M%S) $(basename $0) $(cat "$tmp"|tr '\n' ' ')" >> $powerdir/log
mv "$tmp" "${tmp%.tmp}.list"
# end

This atomically creates powersave/suspend/*.list and
powersave/resume/*.list files with node names in them.

On the priviledged server, a script periodically looked at the directories
(via ssh) and triggered the appropriate actions, including some
heuristics about unlcean shutdowns or spontaneous re-availability (with
a thousand runs, there's a good chance for something getting stuck, in
some driver code, even).

#!/bin/sh

powerdir=/dev/shm/powersave

batch()
{
  ssh-wrapper-that-correctly-quotes-argument-list --host=batchhost "$@"
}

while sleep 5
do
  suspendlists=$(batch ls "$powerdir/suspend/" 2>/dev/null | grep '.list$')
  for f in $suspendlists
  do
    hosts=$(batch cat "$powerdir/suspend/$f" 2>/dev/null)
    for h in $hosts
    do
      case "$h" in
      node*|data*)
        echo "suspending $h"
        node-shutdown-wrapper "$h"
      ;;
      *)
        echo "malformed node name"
      ;;
      esac
    done
    batch rm -f "$powerdir/suspend/$f"
  done
  resumelists=$(batch ls $powerdir/resume/ 2>/dev/null | grep '.list$')
  for f in $resumelists
  do
    hosts=$(batch cat "$powerdir/resume/$f" 2>/dev/null)
    for h in $hosts
    do
-      case "$h" in
      node*)
        echo "resuming $h"
        # Assume the node _should_ be switched off. Ensure that now (in
        # case it hung during shutdown).
       if ipmi-wrapper "$h" chassis power status|grep -q on$; then
          if ssh -o ConnectTimeout=2 "$h" pgrep slurmd >/dev/null 2>&1 </dev/null; then
            echo "skipping apparently active node $h"
          else
            echo "forcing power reset on $h"
            ipmi-wrapper "$h" chassis power reset
          fi
        else
          ipmi-wrapper "$h" chassis power on
        fi
        # Wait to make sure?
      ;;
      *)
        echo "malformed node name"
      ;;
      esac
    done
    batch rm -f "$powerdir/resume/$f"
  done
done
# end

The current approach handles resume better, waiting for a number of
hosts at he same time and only un-draining those that reappeared.

Back then, we relied on the nodes being automatically incorporated by
slurmctld. This worked mostly, but not always, resulting in spurious
NODE_FAILs which started to annoy users.

Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg