[slurm-users] [External] Power saving method selection for different kinds of hardware
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Mar 27 18:32:22 UTC 2023
Hi Prentice,
Since the last message I figured out a way to implement power_save:
I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
This page contains a link to power_save scripts on GitHub.
Best regards,
Ole
On 27-03-2023 19:35, Prentice Bisbal wrote:
> I'm just catching up on old mailing list messages now. Why not make your
> SuspendProgram and ResumePrograms be shell scripts that look at some
> node information in Slurm (look at the features as in your example) or
> some other source ( use a case statement based on node names) and call
> the correct suspend/resume command based on that?
>
> I agree that attaching this metadata in the node definition and have
> slurm act on it directly is the best solution, but in the meantime,
> having a shell script that can figure out the correct way to
> suspend/resume each host should be very doable, if not ideal.
>
> Prentice
>
> On 11/8/22 09:36, Ole Holm Nielsen wrote:
>> I'm thinking about the best way to configure power saving (see
>> https://slurm.schedmd.com/power_save.html) when we have different
>> types of node hardware whose power state have to be managed differently:
>>
>> 1. Nodes with a BMC NIC interface where "ipmitool chassis power ..."
>> commands can be used.
>>
>> 2. Nodes where the BMC cannot be used for powering up due to the
>> shared NICs going down when the node is off :-(
>>
>> 3. Cloud nodes where special cloud CLI commands must be used (such as
>> Azure CLI).
>>
>> The slurm.conf only permits one SuspendProgram and one ResumeProgram
>> which then need to figure out the cases listed above and perform
>> appropriate actions.
>>
>> I was thinking to add a node feature to indicate the kind of power
>> control mechanism available, for example along these lines for the 3
>> above cases:
>>
>> Nodename=node001 Feature=power_ipmi
>> Nodename=node002 Feature=power_none
>> Nodename=node003 Feature=power_azure
>>
>> The node feature might be inquired in the SuspendProgram and
>> ResumeProgram and jump to separate branches of the script for power
>> control commands.
>>
>> Question: Has anyone thought of a similar or better way to handle
>> power saving for different types of nodes?
>>
>> Thanks,
>> Ole
>>
>
More information about the slurm-users
mailing list