[slurm-users] Change ExcNodeList on a running job

Rodrigo Santibáñez rsantibanez.uchile at gmail.com
Thu Jun 4 20:15:44 UTC 2020


Hello,

Jobs can be requeue if something wrong happens, and the node with failure
excluded by the controller.

*--requeue* Specifies that the batch job should eligible to being requeue.
The job may be requeued explicitly by a system administrator, after node
failure, or upon preemption by a higher priority job. When a job is
requeued, the batch script is initiated from its beginning. Also see the
*--no-requeue* option. The *JobRequeue* configuration parameter controls
the default behavior on the cluster.

Also, jobs can be run selecting a specific node or excluding nodes

*-w*, *--nodelist*=<*node name list*> Request a specific list of hosts. The
job will contain *all* of these hosts and possibly additional hosts as
needed to satisfy resource requirements. The list may be specified as a
comma-separated list of hosts, a range of hosts (host[1-5,7,...] for
example), or a filename. The host list will be assumed to be a filename if
it contains a "/" character. If you specify a minimum node or processor
count larger than can be satisfied by the supplied host list, additional
resources will be allocated on other nodes as needed. Duplicate node names
in the list will be ignored. The order of the node names in the list is not
important; the node names will be sorted by Slurm.

*-x*, *--exclude*=<*node name list*> Explicitly exclude certain nodes from
the resources granted to the job.

does this help?

El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. (<
Geoffrey.Ransom at jhuapl.edu>) escribió:

>
>
> Hello
>
>    We are moving from Univa(sge) to slurm and one of our users has jobs
> that if they detect a failure on the current machine they add that machine
> to their exclude list and requeue themselves. The user wants to emulate
> that behavior in slurm.
>
>
>
> It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList
> $NEWExcNodeList” won’t work on a running job, but it does work on a job
> pending in the queue. This means the job can’t do this step and requeue
> itself to avoid running on the same host as before.
>
>
>
> Our user wants his jobs to be able to exclude the current node and requeue
> itself.
>
> Is there some way to accomplish this in slurm?
>
> Is there a requeue counter of some sort so a job can see if it has
> requeued itself more than X times and give up?
>
>
>
> Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200604/5dc89fc0/attachment.htm>


More information about the slurm-users mailing list