<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi Alexander:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">This is a great case for using Node Health Check (https://github.com/mej/nhc). We use this so that each node periodically runs an admin-selected set of tests (e.g. "is /work readable?"), and automatically Drains a node which fails any
of them, and puts the reason in the node's Reason attribute, and can be set to Resume the node upon a future successful test run, or not. We use NHC in this way to protect jobs from starting when there's filesystem trouble. Jobs retain their priority and
pend properly until the nodes report the filesystem is available again.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">As another option I think you could use Slurm 'licenses' to control dispatch to nodes depending on filesystem availability. For example, assign the cluster 99000 of feature type 'license' called e.g. 'scratch_lic', and use job_submit.lua
(or the users) to cause scratch-requesting jobs to request a scratch_lic also. You won't need to care how many scratch_lic are available as long as you start it with a number much larger than your max possible concurrent job count. All you need to do is
set it to either zero, or that large number, with 'scontrol' whenever you want to enable or disable the launching of the filesystem-using subset of jobs. That could be automated if you had a reliable test you could run outside of Slurm, which ran 'scontrol'
as needed. You wouldn't have to change any node parameters, and no submissions would be rejected based on filesystem availability (since the license stuff can't affect job submission, only dispatch).<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I'm sure there could be other solutions. I've not thought further on this since I've been happily using NHC for a long time.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="color:black">== <o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">Paul Brunk, system administrator</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:black">Georgia Advanced Resource Computing Center<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">Enterprise IT Svcs, the University of Georgia<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-bottom:12.0pt">On 2/1/22, 4:59 AM, "slurm-users" <slurm-users-bounces@lists.schedmd.com> wrote:<o:p></o:p></p>
<div>
<p class="MsoNormal">I hope someone is out there having some experience with the<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">"ActiveFeatures" and "AvailableFeatures" in the node configuration and<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">can give some advise.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">We have configured 4 nodes with certain features, e.g.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">"NodeName=thin1 Arch=x86_64 CoresPerSocket=24<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=96 CPULoad=44.98<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> AvailableFeatures=work,scratch<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> ActiveFeatures=work,scratch<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">..."<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The features are obviously filesystems mounted. Now we are going to take<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">away one filesystem (work) for maintenance. Therefore we wanted to take<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">away the feature from the nodes. We tried e.g.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"># scontrol update node=thin1 ActiveFeatures="scratch"<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">resulting in<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">"NodeName=thin1 Arch=x86_64 CoresPerSocket=24<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=96 CPULoad=44.98<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> AvailableFeatures=work,scratch<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> ActiveFeatures=scratch<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">..."<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The problem now is that no jobs can be SUBMITTED requesting the feature<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">work, the error we get is<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">"sbatch: error: Batch job submission failed: Requested node<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">configuration is not available"<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Does this make sense? We want our users to submit jobs requesting<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">features that are available in general because maintenances usually<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">don't last too long and the users want to submit jobs for the time once<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">the feature is available again since we have rather long queuing times.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">I understand that jobs might be rejected when the feature is not<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">available at all but not when it is not active?! Furthermore, also 4<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">node jobs get rejected at submission when the feature is only active on<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">3 nodes. Is this a bug? Wouldn't it make more sense that the job just<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">sits in the queue waiting for the features/resources to be activated again?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Maybe someone has an idea how to handle this problem?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Alexander<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
</body>
</html>