[slurm-users] Change ExcNodeList on a running job

Rodrigo Santibáñez rsantibanez.uchile at gmail.com
Thu Jun 4 21:50:31 UTC 2020


What about instead of (automatic) requeue of the job, use --no-requeue in
the first sbatch and when something went wrong with the job (why not
something wrong with the node?) submit again with --no-requeue the job with
the excluded nodes?

something as: sbatch --no-requeue file.sh, and then sbatch --no-requeue
--exclude=n001file.sh (options in the command line overrides the options
inside the script)

El jue., 4 jun. 2020 a las 17:40, Ransom, Geoffrey M. (<
Geoffrey.Ransom at jhuapl.edu>) escribió:

>
>
> Not quite.
>
> The user’s job script in question is checking the error status of the
> program it ran while it is running. If a program fails the running job
> wants to exclude the machine it is currently running on and requeue itself
> in case it died due to a local machine issue that the scheduler has not
> flagged as a problem.
>
>
>
> The current goal is to have a running job step in an array job add the
> current host to its exclude list and requeue itself when it detects a
> problem. I can’t seem to modify the exclude list while a job is running,
> but once the task is requeued and back in the queue it is no longer running
> so it can’t modify its own exclude list.
>
>
>
> I.e…. put something like the following into a sbatch script so each task
> can run it against itself.
>
>
>
> If ! $runprogram $args ; then
>
>   NewExcNodeList=”$ ExcNodeList,$HOSTNAME”
>
>   scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
>
>   scontrol requeue ${ SLURM_JOB_ID}
>
>   sleep 10
>
> fi
>
>
>
>
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Rodrigo Santibáñez
> *Sent:* Thursday, June 4, 2020 4:16 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
>
>
>
> *APL external email warning: *Verify sender
> slurm-users-bounces at lists.schedmd.com before clicking links or attachments
>
>
>
> Hello,
>
>
>
> Jobs can be requeue if something wrong happens, and the node with failure
> excluded by the controller.
>
>
>
> *--requeue*
>
> Specifies that the batch job should eligible to being requeue. The job may
> be requeued explicitly by a system administrator, after node failure, or
> upon preemption by a higher priority job. When a job is requeued, the batch
> script is initiated from its beginning. Also see the *--no-requeue*
> option. The *JobRequeue* configuration parameter controls the default
> behavior on the cluster.
>
>
>
> Also, jobs can be run selecting a specific node or excluding nodes
>
>
>
> *-w*, *--nodelist*=<*node name list*>
>
> Request a specific list of hosts. The job will contain *all* of these
> hosts and possibly additional hosts as needed to satisfy resource
> requirements. The list may be specified as a comma-separated list of hosts,
> a range of hosts (host[1-5,7,...] for example), or a filename. The host
> list will be assumed to be a filename if it contains a "/" character. If
> you specify a minimum node or processor count larger than can be satisfied
> by the supplied host list, additional resources will be allocated on other
> nodes as needed. Duplicate node names in the list will be ignored. The
> order of the node names in the list is not important; the node names will
> be sorted by Slurm.
>
>
>
> *-x*, *--exclude*=<*node name list*>
>
> Explicitly exclude certain nodes from the resources granted to the job.
>
>
>
> does this help?
>
>
>
> El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. (<
> Geoffrey.Ransom at jhuapl.edu>) escribió:
>
>
>
> Hello
>
>    We are moving from Univa(sge) to slurm and one of our users has jobs
> that if they detect a failure on the current machine they add that machine
> to their exclude list and requeue themselves. The user wants to emulate
> that behavior in slurm.
>
>
>
> It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList
> $NEWExcNodeList” won’t work on a running job, but it does work on a job
> pending in the queue. This means the job can’t do this step and requeue
> itself to avoid running on the same host as before.
>
>
>
> Our user wants his jobs to be able to exclude the current node and requeue
> itself.
>
> Is there some way to accomplish this in slurm?
>
> Is there a requeue counter of some sort so a job can see if it has
> requeued itself more than X times and give up?
>
>
>
> Thanks.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200604/20191202/attachment-0001.htm>


More information about the slurm-users mailing list