[slurm-users] Restarting jobs
nicolas.sonoda at versatushpc.com.br
Fri Aug 19 17:37:23 UTC 2022
Thank you very much for the explanation!
De: slurm-users <slurm-users-bounces at lists.schedmd.com> em nome de Paul Brunk <pbrunk at uga.edu>
Enviado: sexta-feira, 19 de agosto de 2022 09:23
Para: Slurm User Community List <slurm-users at lists.schedmd.com>
Assunto: Re: [slurm-users] Restarting jobs
In Slurm lingo this is "job requeueing". The JobRequeue
slurm.conf parameter controls whether Slurm tries to start those
jobs again (requeue vs. job exit).
The slurm.conf doc puts it nicely:
This option controls the default ability for batch jobs to be
requeued. Jobs may be requeued explicitly by a system
administrator, after node failure, or upon preemption by a
higher priority job. If JobRequeue is set to a value of 1, then
batch jobs may be requeued unless explicitly disabled by the
user. If JobRequeue is set to a value of 0, then batch jobs will
not be requeued unless explicitly enabled by the user. Use the
sbatch --no-requeue or --requeue option to change the default
behavior for individual jobs. The default value is 1.
Paul Brunk, system administrator
Advanced Computing Resource Center
Enterprise IT Svcs, the University of Georgia
On 8/18/22, 1:57 PM, "slurm-users" <slurm-users-bounces at lists.schedmd.com> wrote:
In this week, my machines rebooted and the jobs that was running restarted and I've lost the progress that it made. So, can I prevent that restart of jobs? For example if my machines reboot the jobs get cancelled.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users