[slurm-users] Change ExcNodeList on a running job

Ransom, Geoffrey M. Geoffrey.Ransom at jhuapl.edu
Thu Jun 4 20:00:34 UTC 2020


Hello
   We are moving from Univa(sge) to slurm and one of our users has jobs that if they detect a failure on the current machine they add that machine to their exclude list and requeue themselves. The user wants to emulate that behavior in slurm.

It seems like "scontrol update job ${SLURM_JOB_ID} ExcNodeList $NEWExcNodeList" won't work on a running job, but it does work on a job pending in the queue. This means the job can't do this step and requeue itself to avoid running on the same host as before.

Our user wants his jobs to be able to exclude the current node and requeue itself.
Is there some way to accomplish this in slurm?
Is there a requeue counter of some sort so a job can see if it has requeued itself more than X times and give up?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200604/8fd7d031/attachment.htm>


More information about the slurm-users mailing list