[slurm-users] Change ExcNodeList on a running job
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Jun 5 07:41:18 UTC 2020
Hi Geoffrey,
I'm just curious as to what causes a user to decide that a given node
has an issue? If a node is healthy in all respects, why would a user
decide not to use the node?
We can certainly perform all sorts of node health checks from Slurm by
configuring the use of LBNL Node Health Check[1]. Items such as disk
full, network interface down, memory removed, etc. can be checked for.
Slurm will offline any node that fails a NHC check, and no jobs will be
started on that node until the condition has been cleared.
I have some suggestions about NHC usage in my Slurm Wiki:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check
Best regards,
Ole
[1] https://github.com/mej/nhc
On 04-06-2020 23:37, Ransom, Geoffrey M. wrote:
> Not quite.
>
> The user’s job script in question is checking the error status of the
> program it ran while it is running. If a program fails the running job
> wants to exclude the machine it is currently running on and requeue
> itself in case it died due to a local machine issue that the scheduler
> has not flagged as a problem.
>
> The current goal is to have a running job step in an array job add the
> current host to its exclude list and requeue itself when it detects a
> problem. I can’t seem to modify the exclude list while a job is running,
> but once the task is requeued and back in the queue it is no longer
> running so it can’t modify its own exclude list.
>
> I.e…. put something like the following into a sbatch script so each task
> can run it against itself.
>
> If ! $runprogram $args ; then
>
> NewExcNodeList=”$ ExcNodeList,$HOSTNAME”
>
> scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
>
> scontrol requeue ${ SLURM_JOB_ID}
>
> sleep 10
>
> fi
>
> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Rodrigo Santibáñez
> *Sent:* Thursday, June 4, 2020 4:16 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
>
> *APL external email warning: *Verify sender
> slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com> before clicking links or
> attachments
>
> Hello,
>
> Jobs can be requeue if something wrong happens, and the node with
> failure excluded by the controller.
>
> *--requeue*
>
> Specifies that the batch job should eligible to being requeue. The job
> may be requeued explicitly by a system administrator, after node
> failure, or upon preemption by a higher priority job. When a job is
> requeued, the batch script is initiated from its beginning. Also see the
> *--no-requeue* option. The /JobRequeue/ configuration parameter controls
> the default behavior on the cluster.
>
> Also, jobs can be run selecting a specific node or excluding nodes
>
> *-w*, *--nodelist*=</node name list/>
>
> Request a specific list of hosts. The job will contain /all/ of these
> hosts and possibly additional hosts as needed to satisfy resource
> requirements. The list may be specified as a comma-separated list of
> hosts, a range of hosts (host[1-5,7,...] for example), or a filename.
> The host list will be assumed to be a filename if it contains a "/"
> character. If you specify a minimum node or processor count larger than
> can be satisfied by the supplied host list, additional resources will be
> allocated on other nodes as needed. Duplicate node names in the list
> will be ignored. The order of the node names in the list is not
> important; the node names will be sorted by Slurm.
>
> *-x*, *--exclude*=</node name list/>
>
> Explicitly exclude certain nodes from the resources granted to the job.
>
> does this help?
>
> El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M.
> (<Geoffrey.Ransom at jhuapl.edu <mailto:Geoffrey.Ransom at jhuapl.edu>>) escribió:
>
> Hello
>
> We are moving from Univa(sge) to slurm and one of our users has
> jobs that if they detect a failure on the current machine they add
> that machine to their exclude list and requeue themselves. The user
> wants to emulate that behavior in slurm.
>
> It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList
> $NEWExcNodeList” won’t work on a running job, but it does work on a
> job pending in the queue. This means the job can’t do this step and
> requeue itself to avoid running on the same host as before.
>
> Our user wants his jobs to be able to exclude the current node and
> requeue itself.
>
> Is there some way to accomplish this in slurm?
>
> Is there a requeue counter of some sort so a job can see if it has
> requeued itself more than X times and give up?
More information about the slurm-users
mailing list