[slurm-users] Change ExcNodeList on a running job

Fri Jun 5 07:41:18 UTC 2020

Hi Geoffrey,

I'm just curious as to what causes a user to decide that a given node 
has an issue?  If a node is healthy in all respects, why would a user 
decide not to use the node?

We can certainly perform all sorts of node health checks from Slurm by 
configuring the use of LBNL Node Health Check[1].  Items such as disk 
full, network interface down, memory removed, etc. can be checked for. 
Slurm will offline any node that fails a NHC check, and no jobs will be 
started on that node until the condition has been cleared.

I have some suggestions about NHC usage in my Slurm Wiki:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check

Best regards,
Ole

[1] https://github.com/mej/nhc

On 04-06-2020 23:37, Ransom, Geoffrey M. wrote:
> Not quite.
> 
> The user’s job script in question is checking the error status of the 
> program it ran while it is running. If a program fails the running job 
> wants to exclude the machine it is currently running on and requeue 
> itself in case it died due to a local machine issue that the scheduler 
> has not flagged as a problem.
> 
> The current goal is to have a running job step in an array job add the 
> current host to its exclude list and requeue itself when it detects a 
> problem. I can’t seem to modify the exclude list while a job is running, 
> but once the task is requeued and back in the queue it is no longer 
> running so it can’t modify its own exclude list.
> 
> I.e…. put something like the following into a sbatch script so each task 
> can run it against itself.
> 
> If ! $runprogram $args ; then
> 
>    NewExcNodeList=”$ ExcNodeList,$HOSTNAME”
> 
>    scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
> 
>    scontrol requeue ${ SLURM_JOB_ID}
> 
>    sleep 10
> 
> fi
> 
> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of 
> *Rodrigo Santibáñez
> *Sent:* Thursday, June 4, 2020 4:16 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
> 
> *APL external email warning: *Verify sender 
> slurm-users-bounces at lists.schedmd.com 
> <mailto:slurm-users-bounces at lists.schedmd.com> before clicking links or 
> attachments
> 
> Hello,
> 
> Jobs can be requeue if something wrong happens, and the node with 
> failure excluded by the controller.
> 
> *--requeue*
> 
> Specifies that the batch job should eligible to being requeue. The job 
> may be requeued explicitly by a system administrator, after node 
> failure, or upon preemption by a higher priority job. When a job is 
> requeued, the batch script is initiated from its beginning. Also see the 
> *--no-requeue* option. The /JobRequeue/ configuration parameter controls 
> the default behavior on the cluster.
> 
> Also, jobs can be run selecting a specific node or excluding nodes
> 
> *-w*, *--nodelist*=</node name list/>
> 
> Request a specific list of hosts. The job will contain /all/ of these 
> hosts and possibly additional hosts as needed to satisfy resource 
> requirements. The list may be specified as a comma-separated list of 
> hosts, a range of hosts (host[1-5,7,...] for example), or a filename. 
> The host list will be assumed to be a filename if it contains a "/" 
> character. If you specify a minimum node or processor count larger than 
> can be satisfied by the supplied host list, additional resources will be 
> allocated on other nodes as needed. Duplicate node names in the list 
> will be ignored. The order of the node names in the list is not 
> important; the node names will be sorted by Slurm.
> 
> *-x*, *--exclude*=</node name list/>
> 
> Explicitly exclude certain nodes from the resources granted to the job.
> 
> does this help?
> 
> El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. 
> (<Geoffrey.Ransom at jhuapl.edu <mailto:Geoffrey.Ransom at jhuapl.edu>>) escribió:
> 
>     Hello
> 
>         We are moving from Univa(sge) to slurm and one of our users has
>     jobs that if they detect a failure on the current machine they add
>     that machine to their exclude list and requeue themselves. The user
>     wants to emulate that behavior in slurm.
> 
>     It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList
>     $NEWExcNodeList” won’t work on a running job, but it does work on a
>     job pending in the queue. This means the job can’t do this step and
>     requeue itself to avoid running on the same host as before.
> 
>     Our user wants his jobs to be able to exclude the current node and
>     requeue itself.
> 
>     Is there some way to accomplish this in slurm?
> 
>     Is there a requeue counter of some sort so a job can see if it has
>     requeued itself more than X times and give up?