Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
therefore, my *question:*
What is the best practice for excluding the master node from work?
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. RESERVED fits with the second part "The node is in an advanced reservation and *not generally available*." and DRAINED "The node is unavailable for use per system administrator request." fits completely. So is *DRAINED* the correct setting in such a case?
Best regards, Xaver
Dear Xaver,
we have a similar setup and yes, we have set the node to "state=DRAIN". Slurm keeps it this way until you manually change it to e.g. "state=RESUME".
Regards, Hermann
On 6/24/24 13:54, Xaver Stiensmeier via slurm-users wrote:
Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
therefore, my *question:*
What is the best practice for excluding the master node from work?
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. RESERVED fits with the second part "The node is in an advanced reservation and *not generally available*." and DRAINED "The node is unavailable for use per system administrator request." fits completely. So is *DRAINED* the correct setting in such a case?
Best regards, Xaver
On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote:
Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
You're attempting to start the slurmd - which isn't required on this machine, as you say. Disable it. Keep slurmctld enabled (and declared in the config).
therefore, my *question:*
What is the best practice for excluding the master node from work?
Not defining it as a worker node.
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED.
These states are slurmd states, and therefor meaningless for a machine that doesn't have a running slurmd. (It's the nodes that are defined in the config that are supposed to be able to run slurmd.)
So is *DRAINED* the correct setting in such a case?
Since this only applies to a node that has been defined in the config, and you (correctly) didn't do so, there's no need (and no means) to "drain" it.
Best Steffen
Thanks Steffen,
that makes a lot of sense. I will just not start slurmd in the master ansible role when the master is not to be used for computing.
Best regards, Xaver
On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote:
On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote:
Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
You're attempting to start the slurmd - which isn't required on this machine, as you say. Disable it. Keep slurmctld enabled (and declared in the config).
therefore, my *question:*
What is the best practice for excluding the master node from work?
Not defining it as a worker node.
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED.
These states are slurmd states, and therefor meaningless for a machine that doesn't have a running slurmd. (It's the nodes that are defined in the config that are supposed to be able to run slurmd.)
So is *DRAINED* the correct setting in such a case?
Since this only applies to a node that has been defined in the config, and you (correctly) didn't do so, there's no need (and no means) to "drain" it.
Best Steffen
Hi Xaver,
Xaver Stiensmeier via slurm-users slurm-users@lists.schedmd.com writes:
Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
therefore, my question:
What is the best practice for excluding the master node from work?
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. RESERVED fits with the second part "The node is in an advanced reservation and not generally available." and DRAINED "The node is unavailable for use per system administrator request." fits completely. So is DRAINED the correct setting in such a case?
You just don't configure the head node in any partition.
You are getting the error because you are starting 'slurmd' on the node, which implies you do want to run jobs there. Normally you would run only 'slurmctld' and possibly also 'slurmdbd' on your head node.
Cheers,
Loris
Dear Xaver,
Could you clarify the function of what you call "master"?
If it's the Slurm controller, i.e. running slurmctld: Why do you need slurmd running on it as well?
Best, Stephan
On 24.06.24 13:54, Xaver Stiensmeier via slurm-users wrote:
Dear Slurm users,
in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, this is not the way to do this as it is now a fatal error
fatal: Unable to determine this slurmd's NodeName
therefore, my *question:*
What is the best practice for excluding the master node from work?
I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. RESERVED fits with the second part "The node is in an advanced reservation and *not generally available*." and DRAINED "The node is unavailable for use per system administrator request." fits completely. So is *DRAINED* the correct setting in such a case?
Best regards, Xaver