[slurm-users] ActiveFeatures job submission
Paul Brunk
pbrunk at uga.edu
Thu Feb 10 04:27:40 UTC 2022
Hi Alexander:
This is a great case for using Node Health Check (https://github.com/mej/nhc). We use this so that each node periodically runs an admin-selected set of tests (e.g. "is /work readable?"), and automatically Drains a node which fails any of them, and puts the reason in the node's Reason attribute, and can be set to Resume the node upon a future successful test run, or not. We use NHC in this way to protect jobs from starting when there's filesystem trouble. Jobs retain their priority and pend properly until the nodes report the filesystem is available again.
As another option I think you could use Slurm 'licenses' to control dispatch to nodes depending on filesystem availability. For example, assign the cluster 99000 of feature type 'license' called e.g. 'scratch_lic', and use job_submit.lua (or the users) to cause scratch-requesting jobs to request a scratch_lic also. You won't need to care how many scratch_lic are available as long as you start it with a number much larger than your max possible concurrent job count. All you need to do is set it to either zero, or that large number, with 'scontrol' whenever you want to enable or disable the launching of the filesystem-using subset of jobs. That could be automated if you had a reliable test you could run outside of Slurm, which ran 'scontrol' as needed. You wouldn't have to change any node parameters, and no submissions would be rejected based on filesystem availability (since the license stuff can't affect job submission, only dispatch).
I'm sure there could be other solutions. I've not thought further on this since I've been happily using NHC for a long time.
==
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia
On 2/1/22, 4:59 AM, "slurm-users" <slurm-users-bounces at lists.schedmd.com> wrote:
I hope someone is out there having some experience with the
"ActiveFeatures" and "AvailableFeatures" in the node configuration and
can give some advise.
We have configured 4 nodes with certain features, e.g.
"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
CPUAlloc=0 CPUTot=96 CPULoad=44.98
AvailableFeatures=work,scratch
ActiveFeatures=work,scratch
..."
The features are obviously filesystems mounted. Now we are going to take
away one filesystem (work) for maintenance. Therefore we wanted to take
away the feature from the nodes. We tried e.g.
# scontrol update node=thin1 ActiveFeatures="scratch"
resulting in
"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
CPUAlloc=0 CPUTot=96 CPULoad=44.98
AvailableFeatures=work,scratch
ActiveFeatures=scratch
..."
The problem now is that no jobs can be SUBMITTED requesting the feature
work, the error we get is
"sbatch: error: Batch job submission failed: Requested node
configuration is not available"
Does this make sense? We want our users to submit jobs requesting
features that are available in general because maintenances usually
don't last too long and the users want to submit jobs for the time once
the feature is available again since we have rather long queuing times.
I understand that jobs might be rejected when the feature is not
available at all but not when it is not active?! Furthermore, also 4
node jobs get rejected at submission when the feature is only active on
3 nodes. Is this a bug? Wouldn't it make more sense that the job just
sits in the queue waiting for the features/resources to be activated again?
Maybe someone has an idea how to handle this problem?
Thanks,
Alexander
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/8c8a29b0/attachment-0001.htm>
More information about the slurm-users
mailing list