[slurm-users] update_node / reason set to: slurm.conf / state set to DRAINED

Kevin Buckley Kevin.Buckley at pawsey.org.au
Mon Nov 9 00:47:13 UTC 2020


On 2020/11/05 17:15, Zacarias Benta wrote:
> On 05/11/2020 02:00, Kevin Buckley wrote:
>>
>> We have had a couple of nodes enter a DRAINED state where scontrol
>> gives the reason as
>>
>> Reason=slurm.conf
>
> Hi Kevin,
> 
> I have no experience with version 20 of slurm, but probably you have
> some misconfiguration.
> Have you changed any settings in your slurm.conf file after the upgrade?
> Dive into the documentation and verify if there aren't any changes to
> some of the directives within the slurm.conf.

That was my initial thinking, Zacarias, as it was that of our on-site
Cray engineer, and an understandable line of thinking too, given that
we had just upgraded our Slurm version.

The "less clear" part of this was why, when all the nodes within a Cray
get the same OS image and so including the same Slurm deployment, only
a small number of nodes inside the Cray had entered a state with that
"slurm.conf" reason.

You can add to that that the only messages we typically see, regarding
slumr.conf, appear when the SlurmCtlD has a mismatch with the compute
nodes, and so again, just after a redeployment, it was fairly easy to
tick that one off.

On 2020/11/06 05:11, Christopher Samuel wrote:
>
> ... I took a quick look at the 
> source last night and couldn't see anything that looked related, and 
> it's not a message I remember seeing before.

Indeed.

As my own investigation of the code initially suggested, and as SchedMD's
Marcin Stolarek has since confirmed (SchedMD ticket 10143), there is no
path through the Slurm codebase that would give rise to a "reasom" string
of "slurm.conf", and so it's clearly been entered into the system from
outside of Slurm itself.

The current thinking here is that there was an issue being worked on
which had not been assigned a ticket ID in our issue tracking system,
in which case the ticket ID would have been entered as the "reason",
but, for some reason, a rather vague "slurm.conf" string was supplied
instead.

If anything, I have been further into the Slurm codebase than before,
so that's been an education in itself.

Thanks for your thoughts around this,
Kevin
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



More information about the slurm-users mailing list