I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5

I now see additional log output (debug).

[2021-07-30T22:42:05.255] debug:  Spawning ping agent for
[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not
responding, setting DOWN

It's still very sparse, but it looks like slurm is trying to ping nodes
that are already removed (they don't exist anymore - as they are removed by
slurm_suspend.sh script)

I tried sinfo -R but it doesn't really give much info..

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2021-07-30T22:42:05

These machines are gone, so it should not respond.

$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known

This is expected.

Why is slurm keeps trying to contact the node that's already removed?
slurm_suspend.sh does the following to "remove" node from the partition.
> scontrol update nodename=${host} nodeaddr="(null)"
Maybe this isn't the correct way to do it? Is there a way to force slurm to
forget about the node? I tried "scontrol update node=$node state=idle", but
this only works for a few minutes until slurm's ping agent kicks in and
marking them down again.


