Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling.
In the most easiest fashion this could be executing a command like *srun FORALL hostname* which would return the names of the nodes if it succeeds and an error message otherwise. However, there's no such input value like FORALL as far as I am aware. One could use -N{total node number} as all nodes are ~idle when this executes, but I don't know an easy way to get the total number of nodes.
There exists good documentation around this, and I recommend to start with the Slurm Power Saving Guide (https://slurm.schedmd.com/power_save.html)
When you have developed a method to power up your cloud nodes, the slurmd's will register with slurmctld when they are started. Simply using the "sinfo" command will tell you which nodes are up (idle) and which are still in a powered-down state (idle~).
When slurmd starts up, it calls the HealthCheckProgram defined in slurm.conf to verify that the node is healthy - strongly recommended. The slurmd won't start if HealthCheckProgram gives a faulty status, and you'll need to check such nodes manually.
So there should not be any need to execute commands on the nodes.
If you wish, you can stll run a command on all "idle" nodes, for example using ClusterShell[1]:
$ clush -bw@slurmstate:idle uname -r
Best regards, Ole
[1] The Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell shows example usage of ClusterShell